Three Runs, Three Days, One Finding: How Iterating on the Benchmark Changed Everything (Except the Answer)

TECH RELEASES

Capstone Development Journal | Month 4: Final Benchmark and Defense Preparation

Three Runs, Three Days, One Finding: How Iterating on the Benchmark Changed Everything (Except the Answer)

Cameron Ashby | March 29, 2026

Series: Capstone Development Journal | Entry 3 of 4

The Goal

This is the last month. Defense is in April. The goal for Month 4 was to finalize everything: achieve complete 50/50/50 benchmark coverage across all three models, write the IEEE paper (Milestone 10), run performance profiling, and prepare for the defense presentation with Dr. Marpaung. Four months ago, I was staring at a Flask backend with demo-mode predictions and a hypothesis I had not yet tested. Now I have real data, real statistics, and a finding I did not expect.

Features Developed: The Three-Run Benchmark

Why Three Runs?

The honest answer is that nothing worked perfectly the first time. The first benchmark run on March 22 was the first time all three models ran against all 50 CASP14 targets simultaneously. It exposed every gap in the pipeline. The second run on March 24 fixed the AlphaFold 2 coverage problem. The third run on March 25 pushed ESMFold and OmegaFold to near-complete coverage. Each iteration taught me something the previous one missed.

Benchmark Coverage Progression

Run	Date	ESMFold	OmegaFold	AF2	What Changed
3ecb2a3d	Mar 22	38/50	38/50	33/50	First full run
3ca6af2a	Mar 24	38/50	38/50	48/50	ColabFold fixed
393aced9	Mar 25	50/50	50/50	50/50	Near-complete

Run 1: March 22 (3ecb2a3d)

The first run established the core finding. ESMFold and OmegaFold each completed 38 out of 50 targets. AlphaFold 2 only hit 33 because ColabFold was still fighting the JAX/Haiku dependency conflict I documented last month. The paired t-test on TM-score between ESMFold and OmegaFold yielded p = 0.551. Not significant. The two models are statistically equivalent in structural accuracy.

I ran the numbers three times because I did not believe them. ESMFold, which runs in 0.45 seconds through Meta’s API with no GPU required, produces the same quality as OmegaFold, which takes 62 seconds on my RTX 4070 and requires a dedicated conda environment to avoid crashing the rest of the backend. Same accuracy. 70x speed difference.

Run 2: March 24 (3ca6af2a)

The second run focused on fixing AlphaFold 2 coverage. The JAX/Haiku conflict was resolved by creating a completely separate conda environment for ColabFold (colabfold_env). The MMseqs2 rate limit was handled with a single-sequence fallback. AF2 coverage jumped from 33 to 48 targets. The ESMFold and OmegaFold data were identical to Run 1 because neither model’s pipeline changed between runs. The TM-score p-value was held at 0.551.

Run 3: March 25 (393aced9)

The final run pushed coverage to 50/50/50. Eleven additional ESMFold targets that had failed in earlier runs were completed (API timeouts resolved by retry logic). Twelve additional OmegaFold targets came through. The one missing ESMFold target was completed manually from the command line with a pLDDT of 45.46 and a prediction time of 8.9 seconds, confirming it was a genuinely difficult target with low model confidence.

With the larger sample (45 paired TM-scores vs 36), the p-value shifted from 0.551 to 0.256. Still not significant. Still equivalent. But now with better statistical power behind the claim.

The Finding That Changed

Here is the part I did not expect. In Runs 1 and 2, OmegaFold produced significantly higher pLDDT confidence scores than ESMFold (p = 0.004). OmegaFold was more confident in its predictions. I noted this in my earlier analysis, and it seemed like a meaningful difference.

In Run 3, with 11 more ESMFold targets and 12 more OmegaFold targets, the pLDDT difference disappeared. p = 0.171. Not significant. The additional targets closed the gap. The earlier significance was an artifact of which proteins happened to complete first, not a real model difference.

This is exactly why you iterate. If I had stopped at Run 1, I would have reported that OmegaFold has significantly higher prediction confidence. That claim would have been technically true for that dataset and completely misleading as a general conclusion. The full dataset tells a different story: ESMFold and OmegaFold are equivalent on both accuracy and confidence.

ESMFold vs OmegaFold: Final Results (393aced9)

Metric	ESMFold	OmegaFold	p-value
Mean TM-Score	0.318	0.358	0.256 (n.s.)
Mean pLDDT	73.94	75.87	0.171 (n.s.)
Median Time	0.45s	58.11s	< 0.0001*
Three-Way Wins	19 / 45	19 / 45	Tied

The New Finding

Run 3 also revealed something invisible in the smaller datasets. One-way ANOVA detected a significant fold-class effect for ESMFold (F = 3.42, p < 0.05). This means protein structural classification (all-alpha, all-beta, alpha-beta) does predict ESMFold’s accuracy. OmegaFold and AlphaFold 2 show no such effect.

My original hypothesis was that protein characteristics would predict which model performs better. The TM-score equivalence finding largely disproved this. But the ANOVA result partially resurrects it, specifically for ESMFold. ESMFold’s lighter architecture (ESM-2 3B, no recycling, single forward pass) appears more sensitive to protein topology than OmegaFold’s deeper network (ESM-2 15B, 16 OmegaPlex blocks, 2 recycles). This is the kind of nuanced finding that only emerges with near-complete coverage.

Milestone 10: The IEEE Paper

The IEEE paper is done. It covers the full journey from Milestone 1 through the final benchmark, including every road bump: the ESMFold pivot from local GPU to Meta’s API, the OmegaFold silent failure from the Anaconda environment mismatch, the NumPy version conflict, the CrAss phage sequence assignment bug, the JAX/Haiku dependency conflict for ColabFold, and the MMseqs2 rate limit. All three benchmark runs are referenced to show how the findings evolved as data improved.

Dr. Marpaung positioned the paper as being “one step ahead” of the primary publications. Jumper et al. presented AlphaFold 2 in isolation. Lin et al. and Wu et al. each presented their models separately. Mirdita et al. made AlphaFold 2 more accessible through ColabFold but did not integrate competing architectures. FastFold Suite is the first platform to integrate all three into a single interface, with a unified statistical comparison framework, built and benchmarked on consumer hardware.

Retrospective

What Went Right This Month

Everything converged. The three-run benchmark progression took coverage from 38/38/33 to 50/50/50 in four days. The IEEE paper distilled 16 weeks of work into a coherent 10-page document with real data, charts, and findings. The verification report confirmed 155 passing unit tests and 134 independent statistical checks. Dr. Marpaung’s weekly meetings kept the defense preparation on track, and his offer to hold Saturday prep sessions showed how invested he is in the outcome.

What Went Wrong This Month

The PLDDT significance flip was a lesson in concluding incomplete data. If I had submitted the IEEE paper after Run 1, I would have reported a significant pLDDT difference that does not exist in the full dataset. I also underestimated how long the ColabFold integration would take. The JAX/Haiku conflict and the MMseqs2 rate limit each cost a full day of debugging. And the silent OmegaFold failure due to an environmental mismatch was the most frustrating bug of the entire project because it produced no error message. The platform appeared to work fine. Only the empty columns in the CSV told me something was broken.

How Can I Improve Moving Forward

The biggest takeaway is to distrust early results until coverage is complete. Statistical significance on a subset is not the same as significance on the full population. Going forward, whether at Purdue or in industry, I will always run the full dataset before reporting findings, even if the preliminary data looks clean.

The second takeaway is about environmental management. Three separate conda environments (base, omegafold, colabfold_env) running on the same machine with conflicting dependencies is a maintenance nightmare. Containerization with Docker would have prevented every single environment-related bug in this project. That is the first infrastructure improvement I will make if I continue developing FastFold Suite.

Course Reflection

This is Software Project: Development II, the last course. Looking back over the full arc, I started the MS program by writing basic classification models and submitting IEEE papers on geometric deep learning prototypes. I built FastFold Suite across two capstones, first as an HCI project (8,000+ lines of React and Flask infrastructure, WCAG 2.1 AA compliance, valedictorian) and then as an ML project (three model integrations, 50-protein benchmark, statistical equivalence finding, 4,350-line backend, 155 unit tests).

The HCI coursework gave me the user-centered design thinking that shaped how FastFold Suite presents complex data. The ML coursework gave me the fundamentals of supervised learning and the statistical testing methodology. The Advanced AI course gave me the knowledge of deep learning architectures I needed to actually understand what ESMFold, OmegaFold, and AlphaFold 2 are doing under the hood. Everything connected.

Dr. Marpaung has been exactly the advisor this project needed. His emphasis on code verification, originality documentation, and correlating findings to published papers pushed me to go deeper than I would have on my own. The fact that he offered Saturday prep sessions as the defense approaches says everything about his commitment.

Am I ready for the defense? The data is there. Three benchmark runs, all confirming the same finding. An IEEE paper with every road bump. 155 tests passing. 134 verification checks matched. 19-19 in the three-way win count. ESMFold is the recommended default for single-sequence protein structure prediction: same accuracy as OmegaFold, 129 times faster, no GPU required.

I am ready.

Appendix A: AI Usage Documentation

This document was developed with the assistance of AI-powered tools for writing quality assurance. Grammarly, an AI-driven writing assistant, was used throughout the drafting process to identify and correct grammatical errors, improve sentence clarity, and ensure consistent tone. Grammarly’s suggestions were reviewed and accepted or rejected on a case-by-case basis. No content was generated solely by the tool. All research, analysis, system design, implementation, and intellectual contributions in this document are the original work of the author.

Tech Releases | Cameron Ashby

Leave a comment Cancel reply