IV. Integrating Holistic Assessment into a Writing Program
Holistic assessment works best as a linchpin for improving legal writing when an organization is oriented toward its basic principles in other parts of the legal writing process. As a starting point, all writers should know the content of the rubric and have opportunities to assess and analyze decisions based on its components. This exercise places writers in the position of readers. Since most administrative law organizations publish administrative decisions on the Internet, these discussions can extend to how the rubric supports readability for a wider audience.
The holistic rubric can also promote the process and discussion that supports organizational reviews of drafts that provide feedback to the writer before a decision becomes published. These reviews can be conducted by peers and supervisors. Even though it is helpful for reviewers to use the language of the rubric when providing feedback, that feedback should not include a holistic score or some final quality judgment about the decision. These scores would not be reliable and consistent as they are not the product of a calibration exercise and more than one reader. They would also tend to advance the writer too hastily through the draft-to-publication process.22 Holistic assessment is most effective during high-stakes sessions that are defined by the organization and preceded by training.23 High-stakes sessions in our organization, for example, are assessment exercises that affect awards and performance evaluations or sessions that are designed to identify model decisions for future assessments.
One advantage an administrative agency may have over law schools or individual legal writing courses is that its employees stay with the organization and the writing environment over a significant amount of time. Longevity provides an opportunity for the organization to conduct periodic writing quality checks each year to assess whether writers in the organization consistently recognize strong and weak writing. The Internet or e-mail provide convenient and low-cost methods to distribute a decision, collect scores, and inform the organization about the results. At our agency, most people find these calibration sessions conducted during the year more informative and interesting than actual high-stakes assessments. Because of the efficiency of these periodic calibration sessions, we have been able to assess over 300 decisions per high-stakes assessment.
These periodic quality checks help speed up calibration when the organization intends to conduct a holistic assessment for high-stakes writing. The pool of evaluators can come from the same pool as the writers in the organization. With proper statistical control and management, we have seen the same levels of reliability as with holistic assessment results in other disciplines. For example, at one of our agency national training conferences, we conducted an analytic writing exercise with members of a nationally recognized testing service. This session validated our rubric and the ability of our own employees to score essays and administrative decisions with the same statistical reliability as the consultants.
A. Metrics for Administrative Decisions
As stated previously, one of the advantages that holistic assessment provides for large legal organizations is that systematic implementation can report valid and reliable data on the writing strengths and weaknesses of an organization. Any discussion about holistic assessment metrics, however, must begin with a brief description of holistic assessment as a psychometric evaluation instrument.
Classified as a form of direct assessment, holistic assessment gains support from linguistic perspectives such as reader response, semiotics, and other views that connect meaning to a process view of language. From most of these perspectives, the act of reading and writing both construct meaning. Further, authorial intention dissipates as the reader becomes more prominent. Proponents of direct assessment focus on the cognitive processes of writing, including the social and linguistic contexts in which it occurs. Reader and writer become more like information-processing mechanisms, processing a complex set of semiotic cues. The assessment of a text, therefore, is an assessment of the interplay between the reader, the writer, and the linguistic and social context of the discourse.24
Direct assessments can be highly contextualized. For writing assessments, the context may come from a prompt or some other impetus that causes the writing. The prompt means that a direct assessment occurs under controlled conditions. Since there is not one right answer to a test question or prompt, direct assessments measure divergent knowledge. In assessing texts, readers evaluate and compare samples produced under the same conditions. Through a rubric, and after training or instruction, readers consider the complete systematic conditions that produce a text and how various elements work together to affect quality. The strength of holistic assessment is that it calls upon the full array of writing skills that reflect real world writing conditions.
As do other direct assessments, holistic assessment relies heavily upon a modern linguistic understanding of the reader. From this theoretical perspective, the reader is no longer a biographical person, but the name of a place where semiotic codes are located and processed. The reader becomes a function that processes signs and enables them to have meaning. Holistic assessment, therefore, applies the perspective of semiotic inquiry. Semiotic inquiry describes how achieving a system of conventions is responsible for meaning;25 holistic assessment describes how well those systems of conventions achieve meaning for the reader. If it is true that the reader becomes the repository for the processing of codes that account for the intelligibility of a text, then holistic assessment is taking the pulse of quality at a key place in the process.
In using the rubric, holistic assessment also adopts some principles of cognitive processes from the early twentieth century Gestalt school of psychology. Holistic assessment applies these psychological principles to the readers of texts. According to these principles, cognitive processes are not additive or elemental. Instead, cognition perceives a phenomenon as greater than the sum of its parts.26 Based upon these cognitive principles, proponents argue that it is possible for readers in a holistic assessment to rank writing samples if the samples are produced under controlled conditions. Further, readers can identify similar characteristic of papers and agree upon the value of these characteristics for any given particular assessment. After training exercises, readers are able to agree on adjacent scores for individual essays. And finally, readers accomplish all of the above while weighing the relationships among hypothetical standards, the total effect and impression of the writing sample, and the varying social-linguistic conditions of the testing environment. Moreover, writers know how their products will be evaluated, who will be the audience, and what will be the contextual conditions of the assessment.
Holistic assessment and direct assessments bring a specific approach to the issue of context in writing assessment to achieve valid and reliable test results. The direct assessment approach varies slightly from the approach used in indirect assessments. Generally, objective empiricism is a presumed strength of indirect assessment, because proponents claim to eliminate context, therefore ensuring that the results of assessments are objective evidence of writing ability. On the other hand, direct assessments, like holistic assessment, acknowledge and manipulate the context, through a prompt, to trigger the semiotic mechanisms that produce a writing sample.27 It is not surprising, therefore, that holistic assessment continuously strives to address concerns about objectivity and reliability and validity of its results. Proponents are trying to show that the semiotic mechanisms employed by readers and writers in holistic assessment behave consistently and predictably.
The main statistical challenges for holistic assessment are predictive validity and instrument and inter-rater reliability. How well the test predicts future writing success, analysis of the holistic prompt, and agreement among readers are all areas of inquiry that address these challenges. Validity and reliability in holistic assessment provide a rich canvas for academics to discuss statistics, testing conditions, and the reporting of results. This discussion can become quite complex even though complex statistical analysis is usually not a trait associated with writing practitioners. As a practical matter, however, a good writing program that uses holistic assessment must collect data that reports information in the following areas: 1) the test must predictably assess writing results over time; 2) the writing program must analyze whether different sets of readers consistently score essays produced under similar conditions; and 3) the program must analyze how consistently readers agree upon individual scores for essays. These statistics take into account all sources of measurement error — the writer, the test, and the scoring protocols.28
B. Reader Protocols and Training Sessions
In line with results for the writing industry, we have found that an administrative law organization can achieve statistically valid and reliable results with holistic assessment. These results depend, however, upon the proper implementation of protocols and training sessions over time.
The common practice for training readers prior to evaluation sessions is first to familiarize them with the scoring rubric. Writing program managers then submit previously scored decisions to the reader-judges for practice scoring. After the judges score the decisions, session leaders provide justification for the accepted score. A give-and-take session usually follows to allow judges and session leaders to discuss their variances.
One challenge that confronts new assessment programs is finding model decisions that reflect representative scores. Most likely, legal writing programs that use holistic assessment systematically will fall into this category.29 Meeting the challenge of finding model decisions is partly art and partly science. These model decisions should come from previous assessments that have had similar prompts; or, as is often the case, they come directly from the current pool of samples during the assessment. A small team of “experts” determines the scores and provides in-depth justification of the strengths and weaknesses of the samples. This information is passed on to prospective judges in training sessions.
Administrative legal organizations have other sources for finding representative decision writing that reflects various scores. Government appeal organizations often provide one or more levels of review of initial appeal decisions. As these decisions filter through the review process, either being upheld or reversed by higher authority, writing program managers can pay attention to both the positive and negative responses reviewers have to the writing. These decisions are often good candidates for training sessions. More complex or controversial decisions that stand the test of time and the administrative review process are usually the sign of mastery level decisions.
As the writing program matures, legal writing professionals will be able to assemble a database of representative writing. Astute writing program managers will ensure that model decisions receive many “looks” and feedback in various forums before they are submitted to a team that will determine their final representative score. Only then will the team provide feedback for the decision by applying the rubric to the content of the decision.
C. Evaluation Session Results
We have found that when inter-rater reliability goes below 85 percent in a holistic assessment, the prior scores of decisions identified by the mismatched pair of readers need to be analyzed.30 For administrative decisions that provoke discrepant scores, introducing additional reads may actually increase variance. The best solution is to refer those decisions to a previously identified small team of calibration experts to resolve differences. Most likely this team will be the readers who identified representative decisions at the beginning of the session. In these instances, the team should attempt to articulate why the decision may have received discrepant scores. Our experience has often been that administrative decisions in this category have an element of complexity or a particular aspect of the issue and analysis has compelled some readers to react more harshly than others.
Some relatively straightforward analysis of central tendencies can demonstrate whether the decision writers in an organization are calibrated. For example, assume that the writing program manager in an appeal agency distributes an administrative decision (Decision X) to all the writers in the organization as part of a quarterly quality writing check. From past holistic assessments, training classes, or an expert team analysis, the known score of the decision is five. Sixty-four writers in the organization — writers who are now acting as readers — might typically submit the following assessment scores:
| Score | Number of Times Scored |
| Six | 12 |
| Five | 25 |
| Four | 19 |
| Three | 8 |
| Two | 0 |
| One | 0 |
A quick spreadsheet analysis and histogram can show some useful information:
| Results for Decision X Mean: 4.6 Mode: 5 Median: 5 Standard deviation: .93 Skew: -.78 |
![]() |
Our experience with these exercises demonstrates the following guideline: when the mean score of the assessment approaches the known score of the decision, and the standard deviation is less than 1.0, the organization approaches appropriate consensus about the quality of the writing. For a large pool of readers, the 1.0 guideline would mean that over 67 percent of the organization has submitted adjacent scores. If the standard deviation is less than 1.0 with other representative samples, combined with other favorable statistics, the program manager can have reasonable assurance that a holistic assessment conducted for hundreds of decisions will produce reliable results. Moreover, ensuring statistical reliability is important so writers can be confident of quality when they see strong decisions and make adjustments to their own writing.
As the histogram and the statistics above show, there are several signs from Decision X that the agency is almost calibrated. First, both the mean and median scores, important measures of central tendency, are equal to the known score of the decision. As the standard deviation reflects, fifty-six readers have scored the decision within one point of the known score. In this instance, the writing program manager can take heart that much of the organization has a consensus about a strongly written decision.
There are also indications, however, that another round of calibration should continue. First, approximately 15 percent of the organization scored Decision X as unsatisfactory. These “three” scores brought the mean score below the known score, but they also show some confusion about what constitutes satisfactory writing. Even though the standard deviation (.93) is within the suggested 1.0 guideline, the measure of skewness suggests that the scores for this decision are skewing negatively. (Generally, measures of skewness between .5 and 1.0 or -.5 and -1.0 show moderate skewness.) Since Decision X was clearly a mastery decision, enough so to warrant twelve readers to submit scores of six, the writing program manager must investigate what elements in the decision affected a group of readers negatively. To investigate these elements, the writing program manager should discuss the scoring with readers or have further training sessions. With repeated calibration sessions that display similar data, organizations can attain satisfactory calibration after four or five decisions.
An “ideally” calibrated organization will submit scores with central tendency statistics that approximate and support the known score. This ideal is never achieved, of course, so there are some additional issues to consider when analyzing scores in calibration exercises. First, the program must take into account the number of readers in a calibration session. For example, in the above session, the writing program manager was calibrating an entire organization of sixty-four writers. For calibrating fewer judges, statistics will show more sensitivity, as the number of calibrated readers becomes smaller. Conclusions about calibration, therefore, must adjust accordingly. For example, if the exercise was calibrating only ten judges, and two of those judges were continuously submitting discrepant scores, the writing program manager should investigate and resolve those discrepancies more quickly.
The known score of previously scored decisions will also affect data interpretation. Decisions at the “six” level, for example, will certainly skew negatively. (There is no more room on the right side of the scale for the data distribution.) Often in calibration sessions, readers initially have difficulty submitting scores at the extreme ends of the spectrum. In the case of a “six” decision, however, the mean score should be above 5.5 and the writing program manager should certainly analyze all scores that fall below a four.
Finally, once reliability has been achieved, an administrative appeals agency over time may monitor the quality of its written products, and, hopefully, demonstrate increased quality in its decision-making. As noted previously, at first our agency had difficulty finding decisions that rated as a six, the highest score. Since holistic assessment has been implemented over the last three years, the mean score of decisions has risen over one point, the number of “six” decisions has increased dramatically, and the number of writers who have received high performance awards based on decisions rated as a six has also markedly increased.
V. Future Issues for Holistic Assessment of Legal Writing
Future research in the use of holistic assessment for legal writing can follow the areas of inquiry already established in writing assessment for other disciplines. Some basic questions emerge: 1) in addition to administrative decisions, can holistic assessment produce the same standards of validity and reliability with other rhetorical forms of legal writing? 2) what writing prompts call upon the appropriate writing skills to substantiate claims about the effects of pedagogy and training for law students and lawyers? 3) in law schools, can holistic assessment be used to allow students to graduate from legal writing programs, to place them in advanced programs, or to judge briefs submitted in moot court exercises? 4) can holistic assessment be used with other forms of writing assessment and with other parts of the law school curriculum to inculcate a consensus about writing quality throughout the curriculum?31
One of the more interesting initiatives in holistic assessment is the automated scoring of essays. Many academic assessments have an essay portion that receives two scores — one from a human reader and one from a computer. Using natural language software programming, computer scoring is able to predict with a very high rate of reliability the scores that human readers would have given an essay. Educational testing companies now offer computer-graded scoring and feedback to student writers who submit essays on line.
We have investigated computer-graded scoring for administrative decisions. Our research shows that while it is theoretically possible, some customization and additional modeling must occur for this form of assessment to become effective and reliable. One component of the computer-graded scoring initiative compares to the judgmental dilemma legal readers face when evaluating texts: generally, computer-grading programs evaluate rhetorical markers in texts that can reliably predict a holistic reader’s response; but as we have argued, legal readers, while judging writing, also evaluate content and apparent truth, especially for issues and logic. Computer assessment, therefore, must become more content-driven. Although content-based analysis is a present component of natural language programming, it is not as fully developed as assessment based on rhetorical markers.
It may be that the future of legal writing assessment will be implemented through some form of artificial intelligence. If that is true, however, then artificial intelligence applications will have to don the same components of the judgmental mindset that legal readers have imposed upon texts for thousands of years. If computer assessment software is to find acceptance by the legal community, writers will have to gain the same level of comfort and faith that Socrates did when he asked the judges to decide “justly” about his rhetoric. Socrates submitted his case, his rhetoric, and inevitably his life to an informed community of readers. And in doing so, he acknowledged the validity of the evaluation protocols placed upon his rhetoric and the evaluators’ right to impose judgment upon him.
22 See Peter Elbow, Everyone Can Write: Essays Toward a Hopeful Theory of Writing and Teaching Writing (Oxford U. Press 2000). At the NAD, we borrow heavily from Peter Elbow’s guidelines for integrating the writing process and feedback into our writing program. We allocate time in the writing process so that writers can touch all the important markers between “private writing” and publication. We do not mix a scoring assessment with peer review feedback. Supervisors conduct supervisory peer reviews with the mindset of facilitators who are helping writers prepare final drafts before Internet publication; thus, their feedback comes in the form of a cordial letter — with complete sentences and well-formulated paragraphs — that emphasize their “reader response” to the final draft of the administrative decision. And their supervisory peer reviews are reviewed quarterly with the same kind of feedback. These are strategies to avert the “war between readers and writers” about which Elbow is concerned.
23 We do formal holistic assessments at national training conferences, for end of the year awards, and for periodic contests. Much of this work is facilitated by electronic transfer, storage, and validation of assessments.
24 Michael M. Williamson, An Introduction to Holistic Scoring: The Social, Historical, and Theoretical Context for Writing Assessment, in Validating Holistic Scoring, supra n. 3, at 9-25.
25 For more information on the reader’s role in the reading act, see Jonathan Culler, The Pursuit of Signs: Semiotics, Literature, Deconstruction 38-43 (Cornell U. Press,1981).
26 Elliot, Plata & Zelhart, supra n. 6, at 15-17. The authors compare the basic principles of the holistic rubric to the cognitive processes of Gestalt psychology.
27 Williamson, supra n. 24, at 29. Williamson provides an excellent discussion that analyzes how distinctions between indirect and direct assessments suffer under the weight of their own defining characteristics. In linguistic theory, it is now common to point out that these disputes themselves are dependent upon the signs that convey them. At an elemental level, they depend upon constructs that are shaken at the outset by the indeterminacy of the sign itself. The reasoning that supports this perspective can be found in Jacques Derrida, Of Grammatology 36 (Gayatri Chakravorty Spivak trans., The Johns Hopkins U. Press 1976). It may be helpful to apply Derrida’s notion of the graph, to this discussion. Derrida argues that the graphic form of words is unstable and with an ungraspable point of origin.
28 Roger D. Cherry & Paul R. Meyer, Reliability Issues in Holistic Assessment in Validating Holistic Scoring, supra n. 3, at 109-38. As part of this thorough discussion of reliability and holistic assessment, Cherry & Meyer go through the advantages and disadvantages of several options for resolving discrepant scores by readers.
29 Initially, in the NAD, it was difficult finding decisions at both ends of the scoring continuum: decisions rated one or six.
30 In high-stakes assessments conducted over the past three years, we have achieved interrater reliability exceeding 90 percent.
31 We do not intend for holistic assessment to preclude other forms of indirect assessment or portfolio grading. In fact, some of these other evaluation tools, especially portfolio grading, can complement holistic assessment.
