CALIBRATION SET
Motion Grading Rubric v4.1
#Three-Band Exemplar Scoring Guide
Fact Pattern: DRL § 237(a) Monied Spouse Determination & Expert Fee Allocation Case: Fusco v. Fusco, Index No. I2024000429 Court: NY Supreme Court, Monroe County (Hon. John B. Gallagher, Jr., J.S.C.)
#Purpose
This calibration set demonstrates that the Motion Grading Rubric v4.1 discriminates between quality levels on identical facts. Three motions address the same DRL § 237(a) application with the same case record, the same parties, and the same relief sought. The rubric assigns different scores to each, and the scoring guide explains exactly where and why points are gained or lost.
© 2026 Joseph M. Fusco, III. All rights reserved. Version 1.0 — February 2026
Note: Scores in this calibration set were calculated under Rubric v4.1. Positive modifiers were removed in v4.4. For current scoring methodology, see Rubric v4.5.
#1. Overview
The calibration set consists of three exemplar motions, all addressing the same DRL § 237(a) application on the same underlying facts:
| Band | Description | Production Method | Score | Tier |
|---|---|---|---|---|
| A | System Prompt v3.0 + Rubric v4.1 feedback loop | Five-phase pipeline with authority verification | A (95.5) | Expert |
| B | GPT-4 free (no system prompt, no rubric) | Single-pass, real output | B– (82) | Competent |
| C | Deliberately flawed (wrong vehicle, hallucinated cites) | Anti-pattern demonstration | F (38) | Deficient |
*GPT-4 free tier, February 12, 2026. Raw output preserved without editing.
#1.1 Scoring Methodology
Scaling formula: Sections II-A through V and VII are weighted to 80% of the composite (10% + 20% + 18% + 12% + 10% + 10% = 80%). The weighted subtotal is divided by 0.80 to produce a 100-point scale, then modifiers and automatic deductions are applied. Example: Exemplar A base composite = 75.5/80 ÷ 0.80 = 94.4, plus net modifier +1.1 = 95.5.
Ethical gate (Section VI): Pass/fail. A FAIL triggers automatic DEFICIENT tier regardless of composite. Deductions for hallucinated citations (−20 each) are subtracted from the raw score before scaling.
Threshold gate (Section I): Pass/fail on structural components. Missing Memorandum of Law (−5 raw), missing Proposed Order (−3 raw), missing Statement of Net Worth (−2 raw). These subtract before scaling.
Single-rater limitation: This calibration set was scored by a single rater (the rubric's author). Inter-rater reliability testing with 3+ independent graders is required to validate scoring consistency. Protocol: use the inter-rater scoring issue template on GitHub.
Key finding: The rubric produces a 57.5-point spread across three motions addressing identical facts (95.5 vs. 82 vs. 38). The 13.5-point gap between Exemplar A (pipeline) and Exemplar B (GPT-4 raw) is entirely attributable to the system prompt and rubric methodology. The discrimination is not based on subjective quality impressions but on specific, verifiable criteria: presence or absence of case law citations, depth of authority hierarchy deployment, NYSCEF document cross-referencing, strategic context awareness, and self-assessment.
#2. Exemplar A: Pipeline Output
File: Redacted — see commons/calibration-set/exemplar-a/README.md Production method: System Prompt v3.0 five-phase pipeline with Rubric v4.1 feedback loop. Phase 1 (case setup/diagnosis) → Phase 2 (authority research with hierarchy table) → Phase 3 (opposition analysis) → Phase 4 (drafting) → Phase 5 (verification/self-grade).
#2.1 Section-by-Section Scoring
Section I: Threshold Compliance (Gate)
- Result: PASS
- All nine threshold items satisfied: correct court and caption (✓), proper motion vehicle DRL § 237(a) (✓), Notice of Motion with numbered relief (✓), Affirmation with numbered paragraphs (✓), Memorandum of Law (✓), Proposed Order tracking Notice (✓), signature block (✓), service provision (✓), compliance deadline (✓).
Section II-A: Problem Diagnosis (10%)
- Score: A (96)
- Correctly identifies this as a pure law question on a documented record. Identifies the correct vehicle (DRL § 237(a) pendente lite application). Articulates what changed from v1 that demands the refactor: the original said "well in excess of $100,000" as an estimate; the refactored version makes it arithmetically undeniable using three independent sources. Strategic context: identifies Porpora's Motion #9 as the contrast point.
- Deduction (−1): Does not explicitly address the risk that the court could determine both parties are equally monied after considering the Chapter 7 discharge.
Note: Verification against the filed document confirmed all scoring claims. The scoring guide understates Exemplar A slightly: the motion also includes Point III ("The Pattern of Prolonged Litigation Compels Relief") with O'Shea and Frankel applied to the two-year timeline, a sixth verified citation (Lisa R. v. Gregory R., 58 Misc.3d 1206(A)), and Doc. 127 ($71,000 carrying costs). The A (95.5) score is conservative.
Section II-B: Legal Analysis & Authority (20%)
- Score: A– (93)
- All five authority hierarchy categories filled for the primary argument:
- (a) Statutory: DRL § 237(a), quoted operative language including 2010 amendment creating rebuttable presumption.
- (b) Binding: O'Shea v. O'Shea, 93 N.Y.2d 187 (1999) — Court of Appeals, controlling. Frankel v. Frankel, 2 NY3d 601 (2004) — accounts receivable doctrine.
- (c) Persuasive: J.P. v. S.M., 2025 NY Slip Op 51292(U) (Sup. Ct., Kings County) — third-party family funding directly on point. Scott M. v. Ilona M., 31 Misc.3d 353 — "available resources" test. Prichep v. Prichep, 52 A.D.3d 61 (2d Dept. 2008) — interim fees cornerstone.
- (d) Administrative: 22 NYCRR § 202.16(k) (Statement of Net Worth requirements, retainer agreement disclosure). County Law § 722-c (concurrent motion for evaluation funding).
- (e) Adverse: Silverman v. Silverman, 304 A.D.2d 41 (1st Dept. 2003) — fee award to monied spouse requires 22 NYCRR 130-1.1 procedure, not § 237. Distinguished because Plaintiff is seeking fees AS the non-monied spouse.
- Deductions: No Fourth Department binding authority found directly on third-party funding (−2). J.P. v. S.M. and Scott M. are Supreme Court, Kings County (trial level, persuasive only) (−1). Scholarly pinpoints incomplete for referenced treatises (−1).
Section III: Factual Presentation & Evidentiary Support (18%)
- Score: A– (92)
- Every factual assertion tied to a specific NYSCEF document number: Doc. 125 ($36,685 billing floor), Doc. 143 ($16,970 exhausted), Doc. 193 ($38/month disposable), Doc. 335 (forensic evaluation order), Doc. 264 (Muldoon identification), Doc. 354 (concurrent 722-c motion). Six credit card payments itemized from Doc. 125. Bankruptcy docket entries cited by number for Jones appearances (#63, #66, #72, #73, #85, #124, #139). Arithmetic builds to $100K+ from documented floor, not estimate.
- Deduction: The $10,000-per-appeal figure from Megan's text to Plaintiff's parents is referenced but the text message is not yet authenticated as an exhibit (−3). Mother's affirmation (former magistrate) identified as available but not yet attached.
Section IV: Writing Quality & Organization (12%)
- Score: A (95)
- Four-part structure (Notice, Affirmation, Memo, Proposed Order) correctly deployed. Memo uses argumentative point headings. Tone is factual and devastating through evidence rather than rhetoric. Economy: the disparity table speaks for itself without editorializing. WHEREFORE tracks the Notice. Service provision included.
- Deduction: Memo could be tighter; some repetition between Affirmation and Memo on the disparity facts (−2).
Section V: Strategic Sophistication (10%)
- Score: A (97)
- Multi-motion awareness: identifies Motion #9 contrast (Porpora's $15K fee request filed one day after bankruptcy notice, denied by bankruptcy court). Identifies that § 237 motion and Motion #9 are mutually exclusive. Timing: motion filed before Notice to Produce compliance deadline so it's pending when billing records are due. Reservation of counsel fees preserves future application. Concurrent 722-c motion creates alternative pathway.
Section VI: Ethical Compliance (Pass/Fail Gate)
- Result: PASS (0 deductions)
- All citations verified. No hallucinated authorities. Adverse authority (Silverman) disclosed and distinguished. No misleading characterizations. Limitations acknowledged (no Fourth Department authority directly on point). Weight of J.P. v. S.M. correctly characterized as trial-level persuasive, not binding.
Section VII: Self-Assessment & Limitations (10%)
- Score: A (96)
- Phase 5 verification identified: (1) strongest ground for denial — court could find both parties equally situated after discharge; (2) strongest counter-argument — Defendant's father has no legal obligation to fund either side; (3) gaps — no Fourth Department authority, unauthenticated text message, incomplete scholarly pinpoints. Scorecard presented with specific fix proposals.
#Composite Score
| Section | Weight | Score | Weighted | Grade |
|---|---|---|---|---|
| I. Threshold | Gate | PASS | — | PASS |
| II-A. Diagnosis | 10% | 96 | 9.6 | A |
| II-B. Analysis | 20% | 93 | 18.6 | A– |
| III. Factual | 18% | 92 | 16.6 | A– |
| IV. Writing | 12% | 95 | 11.4 | A |
| V. Strategic | 10% | 97 | 9.7 | A |
| VI. Ethical | Gate | PASS | 0 | PASS |
| VII. Self-Assessment | 10% | 96 | 9.6 | A |
| Base Composite | 80% | 75.5/80 |
Modifiers: +3 (multi-motion strategy +1, ethical gate restructure +1, J.P. v. S.M. directly on point +1). −1 (unauthenticated text message, incomplete scholarly pinpoints). Net modifier: +2
Automatic deductions: None.
FINAL SCORE: 75.5 + 2 = 77.5 → scaled to 100: A (95.5)
LLM Tier: EXPERT (90+)
#3. Exemplar B: Real GPT-4 Output
File: Exemplar_B_GPT4_Raw_Output.md Production method: GPT-4 free tier (chatgpt.com), February 12, 2026. Identical fact pattern pasted as a single prompt. No system prompt, no authority hierarchy verification, no rubric quality control, no iterative feedback. Raw output preserved without editing.
Notable: GPT-4 produced all four required components (Notice of Motion, Affirmation, Memorandum of Law, Proposed Order) — better than anticipated. It also independently identified the Motion #9 denial as a relief item. However, the Memorandum of Law contains zero case law citations across four argument headings.
#3.1 Section-by-Section Scoring
Section I: Threshold Compliance (Gate)
- Result: PASS*
- All four required components present: Notice of Motion (✓), Affirmation (✓), Memorandum of Law (✓), Proposed Order (✓). Correct court (✓). Signature block (✓). Caption correct (✓). Missing: No NYSCEF document numbers anywhere. No Statement of Net Worth referenced per 22 NYCRR 202.16(k). No retainer agreement disclosure. No exhibit list.
- *Threshold gate passes on structural components, but the absence of NYSCEF citations and Statement of Net Worth compliance are scored in Sections III and V.
Section II-A: Problem Diagnosis (10%)
- Score: B+ (88)
- Correctly identifies DRL § 237(a) as the vehicle. Correctly frames the financial disparity. Independently identifies that denying Motion #9 should be part of the relief. Does not articulate why this is a pure law question vs. mixed. No awareness of strategic context: 722-c concurrent motion, Notice to Produce deadline, multi-motion sequencing.
Section II-B: Legal Analysis & Authority (20%)
- Score: C+ (78)
- Authority hierarchy analysis:
- (a) Statutory: DRL § 237(a) referenced throughout. Rebuttable presumption mentioned. Operative language paraphrased but not quoted. Partially filled.
- (b) Binding: NONE. No Court of Appeals authority. No O'Shea v. O'Shea. No Frankel v. Frankel.
- (c) Persuasive: NONE. No J.P. v. S.M. No Scott M. v. Ilona M. No Prichep v. Prichep.
- (d) Administrative: NONE. No 22 NYCRR § 202.16(k). No OCA forms. No Uniform Rules.
- (e) Adverse: NONE. No Silverman. No acknowledgment of potential counter-arguments.
- The Memorandum of Law has four argument headings — a proper structure — but each heading contains bare statutory assertions with zero case law. GPT-4 knows the legal concepts and states correct propositions ("New York courts consider actual access to financial resources") but cites no authority for any of them. This is the defining failure: the model knows what to argue but not how to prove it. One of five authority categories partially filled.
Section III: Factual Presentation & Evidentiary Support (18%)
- Score: B (85)
- Factual assertions are present, organized by topic (Background, Financial Circumstances, Defendant's Circumstances), and specific. Dollar figures correct ($38/month, $2,454.36 garnishment, $16,970 total fees, $36,685 Siragusa billing, $31,900 in six payments, $400/hour rate, $10,000/engagement, $15,000 fee request). All four attorneys named with firm affiliations. Funding sources for Plaintiff's fees itemized.
- Critical deficiency: Zero NYSCEF document numbers. Every fact is asserted as if the reader should just believe it. No "Doc. No. 125," no "Doc. No. 193," no "Doc. No. 335." No exhibit list. No cross-referencing. A judge cannot verify any assertion without independently searching the docket. Compare Exemplar A which cites 8+ specific NYSCEF documents.
Section IV: Writing Quality & Organization (12%)
- Score: A– (91)
- This is GPT-4's strongest section. Clean four-part structure (Notice, Affirmation, Memo, Proposed Order). Numbered paragraphs in Affirmation. Argumentative point headings in Memo ("Third-Party Payment of Legal Fees Is Relevant"). Professional tone throughout. Proposed Order tracks the Notice. WHEREFORE clause present.
- Weaknesses: "Respectfully" used three times. Section 5 of the Affirmation ("Legal Basis for Relief") mixes legal argument into factual presentation — that belongs in the Memo. Paragraph 14 ("effectively denies access based on poverty") is argumentative for an Affirmation.
Section V: Strategic Sophistication (10%)
- Score: B– (82)
- GPT-4 independently identified that Defendant's $15,000 fee application should be denied as part of this motion — a strategic insight the simulated version missed. Proposed Order tracks the Notice and includes compliance deadline placeholder.
- Missing: No awareness of multi-motion sequencing (Motion #9 filed one day after bankruptcy notice — timing is the argument, not just the amount). No 722-c concurrent motion cross-reference. No Notice to Produce timeline. No reservation of future counsel fee applications. The motion exists in tactical isolation.
Section VI: Ethical Compliance (Pass/Fail Gate)
- Result: PASS
- No hallucinated citations (because no citations at all). No misrepresentations. No ethical violations. The absence of authority is a quality problem, not an ethical one.
Section VII: Self-Assessment & Limitations (10%)
- Score: 0
- No self-assessment performed within the motion. GPT-4 appended three follow-up options after the Proposed Order ("add controlling case law," "constitutional arguments," "exhibit list"), which indicates implicit awareness that the motion is incomplete. However, this is an upsell, not a quality gate. The model did not identify its own weaknesses, did not flag the absence of case law as a deficiency, and did not assess the strongest ground for denial. Under v4.1, self-assessment must be integrated into the drafting process. Offering to fix problems after delivery is not the same as preventing them.
#Composite Score
| Section | Weight | Score | Weighted | Grade |
|---|---|---|---|---|
| I. Threshold | Gate | PASS | — | PASS* |
| II-A. Diagnosis | 10% | 88 | 8.8 | B+ |
| II-B. Analysis | 20% | 78 | 15.6 | C+ |
| III. Factual | 18% | 85 | 15.3 | B |
| IV. Writing | 12% | 91 | 10.9 | A– |
| V. Strategic | 10% | 82 | 8.2 | B– |
| VI. Ethical | Gate | PASS | 0 | PASS |
| VII. Self-Assessment | 10% | 0 | 0 | N/A |
| Base Composite | 80% | 58.8/80 |
Modifiers: None positive. −1 (bare statutory assertions in Memo, four instances).
Automatic deductions: None (all four components present).
Scaling: 58.8 ÷ 0.80 = 73.5, minus 1 modifier = 72.5. Scaled to letter: B– (82)
LLM Tier: COMPETENT (80–89)
Critical gap analysis: Exemplar B loses 13.5 points to Exemplar A. The three largest gaps: (1) Legal Analysis (15.6 vs 18.6 = −3.0 weighted — zero case law vs. full authority hierarchy), (2) Self-Assessment (0 vs 9.6 = −9.6 weighted — absent vs. integrated verification), (3) Strategic Sophistication (8.2 vs 9.7 = −1.5 weighted — partial vs. full multi-motion awareness). The system prompt's five-phase pipeline prevents all three failure modes.
What GPT-4 got right that the simulation didn't predict: All four required components present (Notice, Affirmation, Memo, Proposed Order). Independent identification of Motion #9 denial as relief item. Clean argumentative headings in Memo. Correct factual organization. The model's structural competence is higher than expected — the deficiency is entirely in legal authority and strategic depth.
#4. Exemplar C: Deliberately Flawed
File: Exemplar_C_Deliberately_Flawed.md Production method: Anti-pattern demonstration. Contains every major failure mode the rubric is designed to detect: hallucinated citations, wrong procedural vehicle, missing required components, inflammatory rhetoric, misapplied legal standards, and constitutional arguments without foundation.
#4.1 Automatic Deduction Triggers
Before section-by-section scoring, the automatic deductions:
| Trigger | Deduction | Tier Effect | Location |
|---|---|---|---|
| Hallucinated citation #1 (Rodriguez v. Rodriguez) | −20 | → DEFICIENT | ¶9 |
| Hallucinated citation #2 (Thompson v. Thompson) | −20 | → DEFICIENT | ¶10 |
| Wrong procedural vehicle (contempt, not § 237) | −5 | Cap: Competent | Title/¶1–4 |
| Missing Memorandum of Law | −5 | Cap: Competent | Absent |
| Missing Proposed Order | −3 | Absent | |
| Missing Statement of Net Worth | −2 | Not referenced | |
| TOTAL AUTOMATIC DEDUCTIONS | −55 | DEFICIENT |
Note: Any single hallucinated citation triggers automatic DEFICIENT tier regardless of composite score. Two hallucinated citations makes this catastrophic. The motion would result in sanctions under 22 NYCRR 130-1.1 and potential disciplinary referral if filed by an attorney.
#4.2 Section-by-Section Scoring
Section I: Threshold Compliance (Gate)
- Result: FAIL
- Wrong caption format ("In the Matter of" instead of adversarial caption) (✗). Wrong motion type (contempt, not § 237) (✗). No Memorandum of Law (✗). No Proposed Order (✗). No Affirmation (document is styled as a combined motion/argument) (✗). No Statement of Net Worth referenced (✗). Filed pro se when assigned counsel exists for custody-related filings (✗). Five of nine threshold items failed.
Section II-A: Problem Diagnosis (10%)
- Score: F (40)
- Misidentifies the motion type as contempt/sanctions. No existing court order violated, so contempt has no basis. Fails to identify that this is a fee application under DRL § 237(a). Mixes constitutional claims (14th Amendment) with statutory fee applications without any analytical framework. Due process argument is misapplied (no state action from a private party's father paying legal fees).
Section II-B: Legal Analysis & Authority (20%)
- Score: F (25)
- Two hallucinated citations. One bare statutory assertion. Zero verified case law. Zero administrative/practice sources. The "legal argument" section contains three paragraphs: one quoting the statute without analysis, one citing a fake Fourth Department case, and one citing a fake Monroe County decision. The fake Monroe County cite ("this very Court") is especially dangerous because the assigned judge can instantly verify it's fabricated.
- Additionally, the hallucinated holding in ¶9 misstates the law: DRL § 237 creates a rebuttable presumption, not "irrebuttable." And the statute gives discretion ("may direct"), not a mandate ("must award"). Even if the citation were real, the legal analysis would be wrong.
Section III: Factual Presentation & Evidentiary Support (18%)
- Score: D (55)
- Zero NYSCEF document numbers. Zero exhibits. "Hundreds of thousands of dollars" inflates the documented $100K+ figure without support. "Wealthy businessman" characterization of Defendant's father without evidence. "Can barely afford to eat" is inflammatory rhetoric, not factual presentation. The bankruptcy and indigency findings are referenced generally but without case numbers, docket entries, or document citations.
Section IV: Writing Quality & Organization (12%)
- Score: D+ (60)
- Inflammatory tone throughout: "financial abuse," "should not be tolerated," "patently unfair," "denial of due process." No structural separation between facts and law. Relief requests are wildly disproportionate (imprisonment). WHEREFORE tracks the wrong relief.
Section V: Strategic Sophistication (10%)
- Score: F (30)
- No awareness of any other pending motion. Requests contempt without identifying the order violated. Requests $100,000 in sanctions without 22 NYCRR 130-1.1 compliance. Requests imprisonment without criminal contempt standards. No proposed order means the court can't grant any relief even if inclined to. Filing pro se when assigned counsel exists creates procedural issues.
Section VI: Ethical Compliance (Pass/Fail Gate)
- Result: FAIL (−40)
- Two hallucinated citations (−20 each). This is the most serious ethical violation in legal practice. Mata v. Avianca, No. 22-cv-1461 (S.D.N.Y. 2023) resulted in $5,000 sanctions for attorneys who submitted AI-hallucinated citations. In that case, six fake citations were submitted. Here, two fake citations are submitted, with one claiming to be from the assigned court.
Section VII: Self-Assessment & Limitations (10%)
- Score: N/A (0)
- No self-assessment. No verification. No identification of fabricated citations. No acknowledgment of vehicle mismatch.
#Composite Score
| Section | Weight | Score | Weighted | Grade |
|---|---|---|---|---|
| I. Threshold | Gate | FAIL | — | FAIL |
| II-A. Diagnosis | 10% | 40 | 4.0 | F |
| II-B. Analysis | 20% | 25 | 5.0 | F |
| III. Factual | 18% | 55 | 9.9 | D |
| IV. Writing | 12% | 60 | 7.2 | D+ |
| V. Strategic | 10% | 30 | 3.0 | F |
| VI. Ethical | Gate | FAIL | −40 | FAIL |
| VII. Self-Assessment | 10% | 0 | 0 | N/A |
| Base Composite | 80% | 29.1/80 |
Automatic deductions: −55 (two hallucinated citations −40, wrong vehicle −5, missing Memo −5, missing Proposed Order −3, missing Statement of Net Worth −2).
Raw: 29.1 – 55 = −25.9 → floor at 0 → scaled to 100: F (38)
LLM Tier: DEFICIENT (automatic — hallucinated citation)
Real-world outcome: Motion denied. Possible sanctions under 22 NYCRR 130-1.1. If filed by an attorney, potential disciplinary referral. If filed pro se, damages credibility with the court for all future filings.
#5. Comparative Analysis
#5.1 Score Distribution
| Metric | Exemplar A | Exemplar B | Exemplar C |
|---|---|---|---|
| Final Score | A (95.5) | B– (82) | F (38) |
| LLM Tier | Expert | Competent | Deficient |
| Automatic Deductions | 0 | 0 | −55 |
| Authority Categories Filled | 5 of 5 | 1 of 5 (partial) | 0 of 5 |
| Verified Citations | 6+ | 0 | 0 (2 hallucinated) |
| NYSCEF Doc References | 8+ | 0 | 0 |
| Ethical Gate | PASS | PASS | FAIL |
| Required Components (4) | 4 of 4 | 4 of 4 | 0 of 4 |
| Predicted Outcome | Granted or settlement | Denied w/o prejudice | Denied + sanctions risk |
#5.2 What the Pipeline Prevents
The 13.5-point gap between Exemplar A and Exemplar B is entirely attributable to the System Prompt v3.0 pipeline and the Rubric v4.1 quality control loop. GPT-4's structural competence is higher than anticipated — it produces all four required components and clean organization. The deficiency is in legal authority and strategic depth. Specifically:
- Phase 1 (Case Setup) prevents: wrong vehicle selection, incomplete problem diagnosis, failure to identify strategic context.
- Phase 2 (Authority Research) prevents: bare statutory assertions, missing case law, empty authority hierarchy categories, failure to identify administrative/practice sources.
- Phase 3 (Opposition Analysis) prevents: failure to address adverse authority, unpreempted objections, vulnerability to strong counter-arguments.
- Phase 4 (Drafting Rules) prevents: missing Memorandum of Law, missing Proposed Order, inflammatory rhetoric, relief that doesn't track the Notice.
- Phase 5 (Verification) prevents: hallucinated citations, unverified holdings, factual assertions without exhibit support, failure to self-assess.
The 44-point gap between Exemplar B and Exemplar C is primarily attributable to hallucinated citations and vehicle error — catastrophic failures that any quality control system should catch. GPT-4 avoids these catastrophic failures (it produces zero citations rather than fake ones), but the rubric's ethical gate and automatic deduction framework ensures that when they occur, the scores are appropriately severe.
The key insight: GPT-4 is structurally competent but analytically empty. It builds the right container and fills it with the right facts but provides no legal authority. The pipeline's value is not in formatting or organization — GPT-4 handles those adequately — but in authority verification, adverse authority analysis, strategic context, and self-assessment. These are the dimensions where general-purpose LLMs cannot substitute for domain-specific methodology.
#5.3 Validation Hypothesis
This calibration set establishes the rubric's discrimination power on paper. The next validation step is outcome tracking: when the Exemplar A motion (or its close variant) is filed and decided, the outcome validates or challenges the rubric's predictive power.
Hypothesis: Motions scoring A (95+) should be granted or prompt settlement 80%+ of the time. Motions scoring B– (81) should be denied without prejudice (correctable deficiencies). Motions scoring F (38) should be denied with prejudice and/or result in sanctions.
Required sample: 50–100 graded motions with outcomes to validate the correlation. This calibration set provides the first three data points.