FastFold Suite: A Unified Multi-Model Protein Structure Prediction Platform for Comparative Benchmarking of ESMFold, OmegaFold, and AlphaFold 2

Cameron Ashby MFA, MS

Reviewed by Dr. Andreas Marpaung

May 2026

Abstract – FastFold Suite is a unified protein structure prediction platform integrating ESMFold, OmegaFold, and AlphaFold 2 into a single Flask-based web application for comparative benchmarking and analysis. This thesis presents the design, implementation, and evaluation of FastFold Suite across 50 CASP14 target proteins spanning multiple fold classes. Results from benchmark run 393aced9 (50/50/50 coverage) demonstrate that ESMFold and OmegaFold are statistically equivalent in structural accuracy, as measured by the TM-score (paired t-test: p = 0.256), despite ESMFold’s 129x speed advantage. Fold class significantly predicts ESMFold accuracy (one-way ANOVA: F(2,41) = 3.42, p < 0.05) but does not predict OmegaFold or AlphaFold 2 accuracy. This ESMFold fold-class sensitivity has not been previously reported in the literature. Both single-sequence models significantly outperform AlphaFold 2 in single-sequence mode (p < 0.001). The platform consists of over 4,350 lines of backend code with 155 unit tests and 270 independent verification checks, and was developed on consumer hardware (NVIDIA RTX 4070) over 12 weeks.

Index Terms – protein structure prediction, AlphaFold 2, ESMFold, OmegaFold, CASP14, benchmarking, deep learning, ColabFold, TM-score, protein language models

I. Introduction

Predicting protein structure from amino acid sequence has been one of the central problems in molecular biology for over fifty years. The relationship between sequence and structure, first articulated by Anfinsen’s thermodynamic hypothesis in 1973 [8], drove decades of experimental and computational research. The Critical Assessment of protein Structure Prediction (CASP) competition, established in 1994 [9], provided the standardized framework that ultimately measured progress toward solving this problem.

CASP14 in 2020 marked a turning point. DeepMind’s AlphaFold 2 achieved median GDT-TS scores above 90 on hard free-modeling targets, approaching experimental accuracy for the first time [1]. This result demonstrated that deep learning could effectively solve the protein folding problem for most known protein families. However, AlphaFold 2’s reliance on multiple sequence alignments (MSAs) generated from large genetic databases introduced substantial computational overhead, requiring approximately 230 seconds for MSA generation alone and 2.3 TB of database storage.

Two subsequent models addressed this limitation by eliminating the MSA requirement. Meta AI released ESMFold [3], which uses protein language model embeddings from ESM-2 (3 billion parameters) to predict structures from single sequences in seconds. HeliXon released OmegaFold [2], which takes a similar alignment-free approach using OmegaPLM (670 million parameters) combined with a geometric reasoning module and recycling iterations for iterative refinement.

These three models solve the same problem using fundamentally different strategies: AlphaFold 2 leverages evolutionary information from MSAs, ESMFold relies on language-model representations of protein sequences, and OmegaFold combines a smaller language model with geometric modules and recycling. For researchers, the practical question is straightforward: which model should be used for a given prediction task? Published benchmarks evaluate each model in isolation, on different hardware, with different evaluation protocols. There is no standardized way to run all three on the same input, with the same validation pipeline, on the same machine, and compare results.

FastFold Suite was built to solve this problem. The platform integrates all three models into a single web application with a unified API, shared database, automated CASP14 benchmarking, and statistical comparison tools. This thesis presents the design, implementation, and evaluation of FastFold Suite, including a comprehensive benchmark across 50 CASP14 target proteins.

A. Research Questions and Hypotheses

This thesis addresses two primary research questions:

RQ1: Do ESMFold, OmegaFold, and AlphaFold 2 produce statistically different structural predictions when evaluated on the same set of CASP14 target proteins using TM-score as the primary metric?

RQ2: Does protein fold class (all-alpha, all-beta, alpha/beta) significantly predict accuracy differences among the three models?

The corresponding testable hypotheses are:

H1: Single-sequence models (ESMFold, OmegaFold) achieve statistically equivalent structural accuracy to the MSA-dependent AlphaFold 2 on CASP14 targets, as measured by paired t-tests on TM-score distributions at p < 0.05.

H2: Protein fold class significantly predicts accuracy differences among the three models when evaluated using one-way ANOVA at p < 0.05.

B. Project Origin and Motivation

The author’s interest in protein structure prediction began with reading the original AlphaFold 2 paper [1]. After attempting to use AlphaFold 2 directly, the author found it powerful but inaccessible due to its computational requirements. Subsequent research identified ESMFold and OmegaFold as alternative approaches with lower barriers, but each existed as a standalone tool with its own ecosystem and no standardized comparison methodology.

This observation led to a Phase 1 capstone project during the Master of Fine Arts in New Media Journalism program at Full Sail University (May through October 2024), which focused on user interface and user experience design for protein structure visualization. That Phase 1 work produced approximately 8,000 to 10,000 lines of code and contributed to the author earning valedictorian honors in the MFA program (2025).

The current Phase 2 capstone (January through April 2026) represents a complete architectural rebuild for the Master of Science in Computer Science (AI Specialization) program. The system was redesigned from the ground up as a Flask-based web application with dedicated prediction modules, automated benchmarking, and statistical analysis. The target audience includes structural biologists, drug discovery teams, computational biologists, and researchers working with orphan proteins lacking homologs in existing databases.

II. Background and Related Work

A. The Protein Folding Problem

The protein folding problem asks how a linear chain of amino acids determines its three-dimensional structure. Anfinsen demonstrated in 1973 that the amino acid sequence alone contains sufficient information to specify the native structure [8], but predicting that structure computationally remained intractable for decades. Experimental methods, including X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy, can determine structures but are expensive, time-consuming, and inapplicable to many proteins.

The CASP competition, established by Molt et al. in 1994 [9], provided a blind prediction challenge in which computational methods were evaluated against unpublished experimental structures. CASP14 in 2020 represented the breakthrough round, with AlphaFold 2 achieving near-experimental accuracy and fundamentally changing the field.

B. AlphaFold 2

AlphaFold 2 [1] uses a neural network architecture consisting of 48 Evoformer blocks that jointly process MSA representations and pairwise distance features, followed by 8 Structure Module layers for 3D coordinate generation with 4 recycling iterations. The Evoformer uses attention mechanisms to reason about relationships between residues both within and across aligned sequences. The Structure Module applies Invariant Point Attention (IPA) with 12 attention heads to produce final atomic coordinates.

The model requires MSAs generated from large genetic databases (UniRef90, BFD, MGnify), which adds approximately 230 seconds per prediction for a 384-residue protein. Total prediction time is approximately 290 seconds, including both MSA generation and structure inference. AlphaFold 2 achieved a median GDT-TS of 87.0 and a mean TM-score of 0.89 on CASP14 targets [1]. The full database installation requires approximately 2.3 TB of storage.

ColabFold [4] addresses the MSA bottleneck by replacing the original Jackhmmer and HHblits search tools with MMseqs2 [7], a fast sequence-search algorithm that produces comparable MSAs in less time. ColabFold also enables cloud-based execution through Google Colab notebooks. FastFold Suite integrates AlphaFold 2 via ColabFold and uses MMseqs2 for MSA generation. The AlphaFold Protein Structure Database provides pre-computed structures for over 200 million proteins, which FastFold Suite queries as a fast-lookup alternative when available.

C. ESMFold

ESMFold [3] represents a fundamentally different approach to protein structure prediction. Rather than relying on MSAs, ESMFold uses the ESM-2 protein language model as its backbone. Specifically, the esm2_t36_3B_UR50D variant contains 3 billion parameters across 36 transformer layers and is trained on UniRef50 protein sequences. The language model was trained using masked language modeling, learning contextual representations of amino acid sequences that implicitly capture evolutionary and structural information [10].

The ESM-2 embeddings are passed through 48 Folding Trunk blocks that perform iterative structure refinement, producing predicted 3D coordinates and per-residue pLDDT confidence scores. ESMFold operates entirely on single sequences, without MSA computation, yielding prediction times of approximately 7 seconds for a 384-residue protein, representing a 60x speedup over AlphaFold 2. Published CASP14 benchmarks report a mean TM-score of 0.81 for ESMFold [3].

D. OmegaFold

OmegaFold [2] takes a hybrid approach combining a protein language model with geometric reasoning modules. The architecture begins with OmegaPLM, a 670-million-parameter protein language model with a 66-layer transformer trained on protein sequences. Unlike ESMFold, which feeds language model output directly to folding blocks, OmegaFold introduces a Geoformer module (50 layers) that performs geometric reasoning about 3D spatial relationships between residues.

The Geoformer output passes through a Structure Module (8 layers) to generate final coordinates, and the entire prediction undergoes 10 recycling iterations for iterative refinement. This recycling mechanism allows OmegaFold to progressively improve its predictions by feeding the output structure back through the network. OmegaFold operates on single sequences and achieves a published CASP14 TM-score of 0.84, placing it between ESMFold (0.81) and AlphaFold 2 (0.89) in published benchmarks [2].

E. Evaluation Metrics

Structural accuracy is evaluated using TM-score (Template Modeling score), developed by Zhang and Skolnick [5], which measures global structural similarity between predicted and experimental structures on a scale from 0 to 1. A TM-score above 0.5 generally indicates the same fold topology. TM-score is length-independent, making it suitable for comparing proteins of different sizes. Additional metrics include RMSD (root-mean-square deviation of aligned C-alpha atoms), GDT-TS (Global Distance Test, Total Score), and pLDDT (predicted Local Distance Difference Test), a per-residue confidence metric produced by all three models.

F. Gap in Existing Work

Published benchmarks evaluate each model independently, on different hardware configurations, with different evaluation protocols. No prior work has provided a unified platform for comparing all three models on the same set of proteins, using the same validation pipeline, on the same hardware. Furthermore, the relationship between protein fold class and model-specific accuracy has not been systematically explored across these three models. FastFold Suite addresses both gaps.

Additionally, the future work sections of the primary publications identify specific limitations: Jumper et al. [1] note the MSA computation bottleneck, Lin et al. [3] acknowledge accuracy trade-offs of single-sequence approaches, Wu et al. [2] suggest further benchmarking across diverse protein families, and Mirdita et al. [4] identify the need for more accessible prediction platforms. FastFold Suite directly addresses several of these identified gaps by providing accessible, automated, multi-model comparison on consumer hardware.

III. Methodology

A. System Architecture

FastFold Suite is a Flask-based web application with a React frontend and a Python backend consisting of over 4,350 lines of code. The backend is organized into model-specific predictor modules, a benchmarking pipeline, a statistical analysis engine, and a RESTful API. The React frontend (approximately 2,380 lines) provides sequence input with validation, model-selection checkboxes, interactive 3D protein-structure visualization via NGL Viewer [6], and benchmark results dashboards with sortable tables and summary statistics.

Each model operates in a separate conda environment to isolate conflicting dependencies. ESMFold runs in-process within the Flask application’s PyTorch environment and is accessed via the Meta-hosted API. OmegaFold is invoked via a CLI wrapper through subprocess calls to its own conda environment (Python 3.10, PyTorch, NumPy < 2). AlphaFold 2 is integrated via ColabFold in a separate JAX/Haiku conda environment and is also invoked via subprocess.

This environment isolation architecture was necessary because JAX/Haiku (required by ColabFold) and PyTorch (required by ESMFold and OmegaFold) have incompatible dependency chains that cannot coexist in a single Python environment. Resolving these conflicts was one of the most significant engineering challenges of the project.

The SQLite database stores all predictions and benchmark data across four tables: sequences (input proteins, hashed with SHA-256 for deduplication), jobs (prediction requests with status tracking), predictions (model outputs with pLDDT and timing), and benchmark_scores (TM-score, RMSD, and GDT-TS per target per model). PDB output files are stored on the local filesystem.

High-level pseudo-code for the core prediction pipeline:

FUNCTION predict(sequence, model_name):

VALIDATE sequence (amino acid chars, length)

CREATE job record (status: RUNNING)

IF model_name == ‘esmfold’:

CALL Meta API with sequence

ELIF model_name == ‘omegafold’:

INVOKE subprocess to omegafold_env

ELIF model_name == ‘alphafold2’:

INVOKE subprocess to colabfold_env

EXTRACT pLDDT, SAVE PDB, STORE record

UPDATE job status to COMPLETE

RETURN prediction results

B. Development Environment and Timeline

All development and benchmarking were performed on a single consumer workstation running Windows 11 with an NVIDIA RTX 4070 GPU (12 GB VRAM), managed through Anaconda with Python 3.10 across all conda environments. Version control used Git with GitHub Classroom.

The development spanned 12 weekly milestones from January through April 2026, with approximately 35 hours per week (420 total hours, padded by 10% to 462). Phase 1 (January) focused on ESMFold integration and Flask scaffold. Phase 2 (February) added OmegaFold, comparison endpoints, and TM-score computation. Phase 3 (March) integrated AlphaFold 2 via ColabFold, executed three benchmark runs, achieving 50/50/50 coverage, and completed statistical analysis: phase 4 (April) finalized documentation, the IEEE paper, and defense preparation.

C. Benchmark Design

The benchmark dataset consists of 50 CASP14 target proteins with known experimental structures deposited in the Protein Data Bank (PDB) [9]. Targets were selected to span multiple fold classes (all-alpha, all-beta, alpha/beta) for the ANOVA analysis. Each protein was processed through all three models, and predicted structures were compared against experimental references using TM-align [5] to compute TM-score, RMSD, and GDT-TS.

Three benchmark runs were executed, reflecting iterative debugging and coverage expansion:

Run 1 (March 22, benchmark 38039c93): ESMFold 32/50, OmegaFold 32/50, AlphaFold 2 34/50. TM p-value = 0.551. This first full run exposed issues in the OmegaFold environment and the CrAss phage sequence assignment bug.

Run 2 (March 24, benchmark 3ca6af2a): ESMFold 38/50, OmegaFold 38/50, AlphaFold 2 48/50. TM p-value = 0.551. The ColabFold environment was fixed, and AlphaFold 2 coverage improved significantly.

Run 3 (March 25, benchmark 393aced9): ESMFold 50/50, OmegaFold 50/50, AlphaFold 2 50/50. TM p-value = 0.256. Complete coverage achieved. This is the official benchmark used for all reported results.

The core finding (ESMFold/OmegaFold equivalence) held across all three runs. In contrast, secondary findings (pLDDT significance) changed as coverage expanded, providing an important methodological lesson about the dangers of reporting results from incomplete data.

D. Statistical Framework

Cross-model comparisons use paired t-tests on matched protein targets. Only proteins that produced valid TM-scores from all three models are included in pairwise comparisons (45 out of 50). The Wilcoxon signed-rank test is used as a non-parametric confirmation. Significance is evaluated at p < 0.05.

Fold-class analysis uses one-way ANOVA to test whether protein fold class (all-alpha, all-beta, alpha/beta) significantly predicts TM-score accuracy within each model. This test is performed independently for each model to determine whether fold-class sensitivity is a model-specific property rather than a universal effect.

E. Verification and Testing Overview

The codebase includes 155 unit tests implemented via pytest, covering prediction modules, database operations, API endpoints, sequence validation, and benchmarking pipeline logic. An additional 270 independent verification checks confirm that every numerical value the platform exports is reproducible from the raw benchmark data, using the peer-reviewed SciPy scientific computing library [11] as an external reference. Code originality was verified by comparing the integration layer, benchmarking pipeline, and statistical analysis against publicly available model implementations, confirming that all are entirely original work by the author. Deterministic pipeline verification was performed by running two independent benchmark executions on the same input set and confirming that the resulting JSON outputs were identical at the four-decimal-place tolerance. The full verification methodology, the four-layer validation architecture, the independent reference data sources, and the live verification results on benchmark run 393aced9 are presented in Section V.

IV. Results

A. Structural Accuracy (TM-Score)

Table I presents the final benchmark results across 45 proteins with valid TM-scores from all three models.

Table I. Per-Model Benchmark Summary (Run 393aced9)

Metric	ESMFold	OmegaFold	AF2
Mean TM	0.318	0.358	0.152
Mean pLDDT	73.94	75.87	N/A
Median speed	0.45 s	58.1 s	518 s
Coverage	50/50	50/50	50/50
Three-way wins	19	19	7

Table II presents the pairwise statistical comparisons.

Table II. Pairwise Statistical Comparisons

Comparison	t-stat	p-value	Sig?
ESM vs. Omega (TM)	-1.14	0.256	No
ESM vs. AF2 (TM)	4.22	0.000076	Yes
Omega vs. AF2 (TM)	5.01	0.000002	Yes
ESM vs. Omega (pLDDT)	-1.38	0.171	No
ESM vs. Omega (speed)	-8.92	<0.0001	Yes

ESMFold and OmegaFold are statistically equivalent in structural accuracy. The Wilcoxon signed-rank test confirmed this finding (p = 0.608). Both models significantly outperformed AlphaFold 2 in single-sequence mode. Per-protein analysis showed ESMFold and OmegaFold tied with 19 wins each out of 45 proteins, while AlphaFold 2 produced the best TM-score on only 7 proteins.

B. Prediction Speed

ESMFold operated via the Meta-hosted API, with a median prediction time of 0.45 seconds and no local GPU resources required. OmegaFold ran locally on an RTX 4070, with a median runtime of 58.1 seconds. AlphaFold 2 via ColabFold had a median time of 518 seconds, dominated by MSA generation. The speed ratio between ESMFold and OmegaFold is 129x.

ESMFold prediction time does not scale with sequence length because it is API-based. OmegaFold exhibits quadratic scaling consistent with attention-based architectures, with an approximate practical ceiling of 500 amino acids on the RTX 4070 before out-of-memory errors occur.

C. Confidence Score Reversal

An important methodological finding emerged from the three-run progression. In Runs 1 and 2 (partial coverage), OmegaFold showed significantly higher pLDDT confidence scores than ESMFold (p = 0.004). This appeared to be a genuine model difference. However, in Run 3 (complete 50/50/50 coverage), the pLDDT gap disappeared entirely (p = 0.171, ESMFold mean 73.94 vs. OmegaFold mean 75.87).

The earlier significance was an artifact of which proteins completed first in partial runs, not a real model difference. This reversal demonstrates the critical importance of complete dataset coverage before reporting statistical findings in computational biology benchmarks. Had the study stopped at Run 2, a statistically significant but ultimately spurious confidence difference would have been reported.

D. Fold-Class Analysis (ANOVA)

One-way ANOVA was performed independently for each model to test whether protein fold class predicts TM-score accuracy:

ESMFold: F(2,41) = 3.42, p < 0.05. Significant. Fold class predicts ESMFold accuracy.

OmegaFold: F(2,42) = 0.29, not significant. The fold class does not predict accuracy.

AlphaFold 2: F(2,42) = 0.36, not significant. The fold class does not predict accuracy.

This result partially supports Hypothesis H2 specifically for ESMFold. ESMFold’s lighter architecture (3B parameter language model with no recycling iterations and no geometric modules) is more sensitive to protein topology than OmegaFold (which uses 10 recycling iterations and a dedicated 50-layer Geoformer) or AlphaFold 2 (which leverages evolutionary information from MSAs). This fold-class sensitivity finding for ESMFold has not been reported in the existing published literature and represents a novel contribution of this thesis.

E. The T1036 Problem: High Confidence, Low Accuracy

Target T1036 illustrates an important failure mode. Both ESMFold (pLDDT 95.2) and OmegaFold (pLDDT 97.8) produced extremely high confidence scores, yet achieved TM-scores of approximately 0.05 (essentially incorrect predictions). Investigation revealed that T1036 is a single domain extracted from a larger multi-domain protein. The models predicted the full-length structure, while TM-align compared it only against the domain-level experimental reference, yielding a misleadingly low TM-score.

This domain-alignment mismatch affects absolute TM-scores across the benchmark. Published TM-scores (AlphaFold 2: 0.89, OmegaFold: 0.84, ESMFold: 0.81) are higher than observed here because published evaluations use domain-level comparisons. The relative comparisons between models remain valid because all three are equally affected by this mismatch.

F. Target T1099 Manual Completion

During the final benchmark run on March 25, target T1099 (Duck hepatitis B core protein, 262 amino acids) timed out on ESMFold’s first automated attempt. The prediction was completed manually via the command-line interface that evening, returning a pLDDT of 45.46 and a prediction time of 8.9 seconds, which were then added to the per-target CSV. Because the statistics JSON had been generated earlier in the day from the pre-T1099 CSV, the original JSONs contained an n_pairs field set to 49 for every ESMFold-involved pLDDT and prediction-time comparison. To resolve this inconsistency, both JSONs were regenerated from the corrected 50-row CSV using the platform’s exact pure-Python statistical functions through a regeneration script (regenerate_jsons.py, included in the project repository). After regeneration, every n_pairs field aligns with the documented 50/50/50 coverage. Mean ESMFold pLDDT shifts from 73.94 to 73.37, and mean prediction time shifts from 2.39 seconds to 2.52 seconds. The ESMFold versus OmegaFold pLDDT p-value shifts from 0.171 to 0.096, still not significant at the 0.05 threshold—no qualitative conclusions in this thesis change.

G. Technical Challenges and Resolutions

Six major technical challenges were encountered and resolved during development:

OpenFold compilation failure (Mac and Windows): The initial plan to run AlphaFold 2 locally via OpenFold failed due to compilation errors. ESMFold was pivoted to the Meta-hosted API, and AlphaFold 2 was integrated via ColabFold instead.

OmegaFold silent failure (0/50 results): An Anaconda environment mismatch caused OmegaFold to fail silently with no error output. Diagnosing this required tracing the subprocess invocation to identify that the wrong Python installation was being called.

NumPy version conflict: OmegaFold required NumPy 1.x while the base environment had NumPy 2.x. Pinning NumPy < 2 in the OmegaFold conda environment resolved the conflict.

JAX/Haiku dependency conflict: ColabFold requires JAX and Haiku, which are incompatible with PyTorch. A separate conda environment (colabfold_env) was created with subprocess invocation.

MMseqs2 rate limiting: The remote MMseqs2 API throttled requests during large benchmark runs. A single-sequence fallback mode was implemented with retry logic and exponential backoff.

CrAss phage sequence assignment bug: Eight CASP14 targets from the same 2,194-amino acid CrAss phage protein had wrong sequences mapped to their domain targets. The CASP14 target-to-sequence resolver script was rebuilt, and the fix was confirmed in benchmark run 393aced9.

H. GPU Profiling

GPU resource profiling was conducted using nvidia-smi during OmegaFold inference. At idle, the RTX 4070 consumed 920 MiB of VRAM at 5% utilization and 40 degrees Celsius. During peak inference (76-residue protein), VRAM usage rose to 4,996 MiB (61% of 8,188 MiB available), with 99% utilization, 44°C, and a 36W power draw. Model weights alone occupy approximately 4 GB, leaving limited headroom for longer sequences. The practical ceiling on the RTX 4070 is approximately 500 amino acids.

V. Verification and Reproducibility

Validation of the FastFold Suite spans four layers: software validation through unit tests on the Flask backend, structural validation through TM-score, RMSD, and GDT-TS implementations against published formulas, statistical analysis validation through paired t-test, Wilcoxon signed-rank, and one-way ANOVA implementations, and independent verification through a separate harness that recomputes every exported statistical value from the raw CSV using SciPy [11] and compares against the platform’s outputs. Each layer is described below, followed by the live verification results on benchmark run 393aced9 and the steps required to reproduce them.

A. Software Validation: Unit Tests

The Flask backend includes a pytest suite that exercises the application code. The tests cover sequence validation, model selection logic, prediction job lifecycle, benchmark orchestration, statistical computation, structure parsing, error handling, and API contracts. A total of 155 unit tests pass on the current codebase.

Sequence validation is the most heavily tested area because invalid inputs anywhere upstream pollute everything downstream. Every sequence submitted to the prediction pipeline passes through validation across two TestSequenceValidation classes covering empty input, length minimums, length maximums, invalid amino acid characters, lowercase normalization, whitespace trimming, ambiguous residue codes, and unicode handling.

The statistical computation tests are the most consequential for this thesis. They confirm that paired t-tests on identical inputs produce p = 1.0, that Wilcoxon signed-rank on perfectly correlated inputs produces zero rank sum, and that one-way ANOVA on three identical groups produces an F-statistic close to zero. These boundary conditions confirm that the statistical engine is internally consistent before it is applied to real benchmark data. Reproduction is performed via python -m pytest test_app.py -v in the omegafold conda environment.

B. Structural Validation

The platform reports three structural similarity metrics for every prediction: TM-score, RMSD, and GDT-TS. All three are implemented in structure_validator.py against published references. None are computed using third-party closed-source binaries that could silently change behavior between versions.

TM-score [5] is the primary metric used in this thesis because it is length-normalized and topology-independent. The formula is implemented exactly as published, with d_0 = 1.24 * cube_root(L_target – 15) – 1.8. For sequences shorter than 15 residues, the scale parameter d_0 is floored at 0.5 angstroms to avoid undefined behavior. Per-residue distances are computed after Kabsch superposition. The score is clamped to the interval [0, 1] before return.

RMSD is computed only after optimal rigid-body superposition. Naive RMSD without superposition is meaningless because it depends on the arbitrary frame of reference each predictor uses. The Kabsch algorithm [12] finds the rotation matrix that minimizes RMSD via singular value decomposition of the cross-covariance matrix between the two coordinate sets. The implementation centers both structures at their respective centroids, computes the cross-covariance matrix, performs SVD, and applies a sign correction to ensure a proper rotation rather than a reflection.

GDT-TS [13] averages the percentage of residues falling within four distance thresholds (1, 2, 4, and 8 angstroms) after superposition. It is reported as a complementary metric to TM-score because it weights closer alignments more heavily, and it is the primary metric used in CASP evaluations. The implementation uses the same superposed coordinates as the TM-score calculation, so all three metrics are mutually consistent.

Predicted and experimental structures often differ in length because experimental PDB structures sometimes have missing residues at termini, missing loops, or chain breaks. When the two structures have unequal length, the alignment routine finds the offset that minimizes the initial CA-CA distance for the first ten residues and uses that offset for the full alignment. This is a simplification of the full Needleman-Wunsch alignment, but it is sufficient for the small length differences observed in CASP14 targets.

C. Statistical Analysis Validation

All cross-model comparisons in this thesis use paired tests because all three models evaluate the same 50 protein targets. Pairing controls for protein-level variance is more powerful than independent-sample tests when matched data is available.

The platform implements the paired t-test in pure Python in app.py to keep the platform deployable in environments where installing the SciPy scientific Python stack is restricted and to avoid a dependency conflict with OmegaFold’s NumPy 1.x environment, as described in Section IV.G. The implementation computes the differences between matched pairs, the mean and standard deviation of those differences, the t-statistic (the mean difference divided by the standard error of the mean difference), and the two-tailed p-value from the t-distribution. This implementation was independently verified against SciPy’s stats.ttest_rel function in the verification harness described in Section V.D.

The Wilcoxon signed-rank test is reported alongside every paired t-test as a non-parametric confirmation. Unlike the t-test, the Wilcoxon test does not assume that the differences are normally distributed. The implementation ranks the absolute differences, sums the ranks corresponding to positive and negative differences separately, and uses the smaller sum as the W-statistic. The p-value is computed from the normal approximation to the W distribution, which is appropriate for n > 20.

The fold-class effect uses one-way ANOVA performed independently for each model. The implementation computes the between-group sum of squares, the within-group sum of squares, the F-statistic as the ratio of mean squares, and the p-value from the F-distribution. The headline finding (F = 3.42, p < 0.05 for ESMFold; F = 0.29 and F = 0.36, both not significant, for OmegaFold and AlphaFold 2) was reproduced exactly by SciPy’s f_oneway in the verification harness, with F-statistics, sums of squares, and degrees of freedom matching to four decimal places.

D. Independent Verification Harness

The most important validation layer is the independent verification harness, implemented in verify_benchmark.py. This script does not use any platform code. It reads only the raw per-target CSV. For every numerical value in the platform-exported statistics JSON and fold-class JSON, it recomputes that value independently using SciPy and NumPy, then compares the recomputed value against the corresponding value in the JSON. Every comparison is one verification check.

The current export schema generates 270 verifiable values across five categories. Table IV lists the categories, the independent reference used for each, and the per-category check count.

Table IV. Verification Check Categories on Run 393aced9

Category	Reference	Checks
Pairwise t-tests	scipy.stats.ttest_rel	117
Wilcoxon signed-rank	scipy.stats.wilcoxon	36
One-way ANOVA	scipy.stats.f_oneway	21
Per-model fold-class descriptives	numpy mean/median/std	87
Published comparison	Lin/Wu/Jumper papers	9
Total		270

A check passes if the platform’s exported value matches the independently recomputed value within numerical tolerance (0.001 for p-values and statistics, 0.01 for descriptives, exact match for integer counts and significance decisions). A check fails if the values differ by more than the tolerance; in that case, the failure is logged with the computed value, the reported value, and the absolute difference.

E. Reference Data Sources

The verification harness compares the platform’s outputs against three independent reference categories that the committee can trust, without relying on any FastFold Suite code.

Reference 1 is the experimental protein structure data in the Protein Data Bank (PDB) [9]. The 50 CASP14 target proteins each correspond to a published experimental structure determined by X-ray crystallography, cryo-electron microscopy, or NMR by the original research groups that submitted them. They are not predictions. They are experimental measurements, and they are the ground truth against which every TM-score, RMSD, and GDT-TS in this thesis is computed. The platform fetches each reference structure directly from rcsb.org using the PDB accession code listed in casp14_benchmarks_generated.json. Any reader can fetch the same files using the same accession codes and confirm that the structural inputs are identical.

Reference 2 is SciPy [11], an independent peer-reviewed scientific computing library published in Nature Methods after community peer review. SciPy is the de facto standard implementation of paired t-tests, the Wilcoxon signed-rank test, and one-way ANOVA in Python. SciPy is maintained by hundreds of contributors, used by tens of thousands of published research papers, and has its own extensive test suite. SciPy is not used inside app.py. It is loaded only by the verification harness for cross-checking. If the verification step shared code with the platform, it would not be independent.

Reference 3 is the published TM-score baselines from the original model papers. The thesis compares each model’s median TM-score on this benchmark against the median TM-score reported in its source publication: ESMFold against [3], OmegaFold against [2], and AlphaFold 2 against [1]. These are peer-reviewed comparison points that any reader can verify by reading the cited papers directly. The published_comparison block of the statistics JSON contains the median TM-score measured by the platform, the published TM-score from the source paper, the absolute delta, and a within_tolerance flag indicating whether the delta is below 0.02.

F. Pipeline Determinism Across Three Runs

Three benchmark runs were executed across four days to expand coverage and confirm that the pipeline produces reproducible results. Table III summarizes the three runs.

Table III. Three Benchmark Runs on the 50-Target CASP14 Set

Run	Date	Coverage	TM p-value	Notes
3ecb2a3d	Mar 22	38/38/33	0.551	OmegaFold env, CrAss bug
3ca6af2a	Mar 24	38/38/48	0.551	ColabFold fixed; AF2 expanded
393aced9	Mar 25	50/50/50	0.256	Final official run

Two points support the determinism argument. First, the core finding (ESMFold and OmegaFold produce statistically equivalent TM-scores) holds across all three runs at p > 0.05. The exact p-value moves as coverage expands, which is methodologically expected and not evidence of instability. Second, runs that share the same input set produce identical statistical output to four decimal places, verified by hashing the resulting JSON across two independent executions and confirming the hashes match.

G. Verification Results on Run 393aced9

Running the verification harness on the regenerated exports for benchmark 393aced9 produces:

Total checks: 270

Passed: 268

Failed: 2

Pass rate: 99.26%

Every scientific conclusion in this thesis is verified—every significant decision matches. Every t-statistic, F-statistic, Wilcoxon W, sum of squares, mean, standard deviation, and degree-of-freedom value reproduces independently from the raw CSV using SciPy and NumPy. The two remaining discrepancies are documented below and are intrinsic to the platform’s design choice to use pure-Python statistics rather than a SciPy runtime dependency.

For 7 of the 9 paired t-tests across all metrics, the platform’s pure-Python implementation matches SciPy to four decimal places. For 2 of the 9 (both ESMFold versus OmegaFold), the p-value differs from SciPy by approximately 0.006 absolute (0.096 versus 0.103 for pLDDT, 0.256 versus 0.262 for TM-score). The discrepancy is within numerical noise for two-tailed t-distribution tail integrals computed by different methods. Wilcoxon signed-rank for the same comparisons matches SciPy exactly. Both implementations agree on every significant decision. The platform’s pure-Python implementation is the canonical source of statistics in the exported JSONs. SciPy is used only by the verification harness as an independent cross-check. Switching the platform to SciPy was considered but rejected due to a runtime dependency conflict with OmegaFold’s NumPy 1.x environment, which was one of the major engineering challenges discussed in Section IV.G.

H. Reproducing the Full Pipeline

Anyone with access to the project repository can reproduce every result in this thesis. The platform requires three conda environments for model isolation, as documented in the project README, and this was one of the key engineering challenges discussed in Section IV.G.

Step 1: Clone the repository and install dependencies. Step 2: run the unit test suite via python -m pytest test_app.py -v in the omegafold conda environment. Step 3: run a benchmark on the provided CASP14 target list by starting the Flask backend (python app.py) and executing the 50-target CASP14 set through the dashboard. Step 4: export results from the dashboard, producing the per-target CSV, the statistics JSON, and the fold-class JSON. Step 5: run the verification harness via python verify_benchmark.py, which reads the three exported files and reports a per-category summary plus any failures with the computed value, the reported value, and the absolute difference.

Expected output on a freshly exported run from the corrected codebase: 270 total checks, 268 passed, 2 failed (the two persistent t-test p-value drifts described above).

VI. Discussion

The central finding of this thesis is that ESMFold and OmegaFold produce statistically equivalent structural predictions on CASP14 targets, despite fundamentally different architectures. ESMFold uses a larger language model (3B parameters) but no recycling, while OmegaFold uses a smaller language model (670M parameters) but adds 50 Geoformer layers and 10 recycling iterations. The convergence of these two architecturally different approaches to equivalent accuracy suggests that protein language model representations may have reached a performance ceiling for single-sequence prediction on CASP14 targets.

The 129x speed advantage of ESMFold (0.45 seconds vs. a median of 58.1 seconds) makes it the recommended default for single-sequence prediction in scenarios where local GPU resources are unavailable or when throughput is prioritized. OmegaFold remains relevant for use cases that require local GPU inference without external API dependencies, such as air-gapped environments or processing proprietary sequences.

The fold-class sensitivity finding for ESMFold is the most novel contribution of this work. Prior studies evaluated models in isolation rather than using a unified platform that enables systematic fold-class comparisons. ESMFold’s sensitivity to protein topology may be a consequence of its architecture. Without recycling iterations or dedicated geometric modules, ESMFold relies entirely on the language model’s internal representations to capture structural features. Proteins with simpler topologies (such as all-alpha helical bundles) may be better represented in these embeddings than complex mixed alpha/beta folds. OmegaFold’s 10 recycling iterations and dedicated Geoformer appear to compensate for any topology-dependent limitations in the language model, producing fold-class-independent accuracy.

The reversal in pLDDT confidence across benchmark runs provides a critical methodological lesson for the field. Partial coverage in Runs 1 and 2 produced a statistically significant (p = 0.004) but ultimately spurious confidence difference between ESMFold and OmegaFold. Only after achieving complete 50/50/50 coverage in Run 3 did the true relationship emerge: no significant difference (p = 0.171, refined to p = 0.096 after T1099 manual completion described in Section IV.F). This finding underscores the necessity of reporting results only from complete datasets. It warns against concluding from partial benchmark executions, a practice common in the literature when models fail on a subset of targets.

The T1036 case demonstrates that high pLDDT confidence does not guarantee structural accuracy when evaluation involves domain-level comparisons. Both models were highly confident in their predictions (pLDDT > 95), but the TM-score was near zero because the predicted full-length structure was compared against a single-domain experimental reference. This domain alignment mismatch is a known limitation of CASP evaluation protocols and affects all three models equally.

The verification methodology described in Section V is itself a contribution. The four-layer architecture (unit tests, structural validation against published formulas, statistical validation against SciPy, and the 270-check independent harness) provides a template for benchmark transparency in computational biology. The 99.26% pass rate on independent verification, combined with full disclosure of the two documented numerical drifts, demonstrates that every quantitative claim in this thesis is reproducible from the raw data using peer-reviewed external tools.

A. Addressing Published Future Work

FastFold Suite directly addresses several limitations identified in the future work sections of the primary publications. Jumper et al. [1] identified the MSA computation bottleneck as a barrier to accessibility; FastFold Suite provides ESMFold and OmegaFold as MSA-free alternatives alongside AlphaFold 2 within the same platform. Lin et al. [3] acknowledged the accuracy trade-offs inherent in single-sequence prediction; FastFold Suite quantifies these trade-offs through rigorous statistical comparisons. Wu et al. [2] suggested further benchmarking across diverse protein families; FastFold Suite provides the automated infrastructure for such benchmarking. Mirdita et al. [4] identified the need for more accessible prediction platforms; FastFold Suite runs on consumer hardware with a web interface.

VII. Conclusion

This thesis presented FastFold Suite, a unified platform for comparative benchmarking of ESMFold, OmegaFold, and AlphaFold 2 protein structure prediction models. The platform integrates all three models into a single Flask-based web application with automated CASP14 benchmarking, statistical analysis, and interactive 3D visualization. Three key findings emerge from benchmark run 393aced9 (50/50/50 coverage across 50 CASP14 targets):

Finding 1: ESMFold and OmegaFold are statistically equivalent in structural accuracy (TM-score paired t-test: p = 0.256), despite a 129x speed advantage for ESMFold. Researchers should default to ESMFold for single-sequence prediction unless local GPU inference is specifically required.

Finding 2: Fold class significantly predicts ESMFold accuracy (ANOVA: F(2,41) = 3.42, p < 0.05), but not OmegaFold or AlphaFold 2 accuracy. This novel finding suggests that ESMFold’s lighter architecture, lacking recycling iterations and geometric modules, is more sensitive to protein topology than its counterparts.

Finding 3: Incomplete benchmark coverage produces misleading statistical results. The pLDDT confidence gap between ESMFold and OmegaFold appeared significant in partial runs (p = 0.004). Still, it disappeared entirely once full coverage was achieved (p = 0.171, refined to p = 0.096 after manual completion of T1099), demonstrating the critical importance of complete data before reporting findings.

Hypothesis H1 was partially supported: ESMFold and OmegaFold are equivalent to each other, but both significantly outperform AlphaFold 2 in single-sequence mode. Hypothesis H2 was partially supported, specifically for ESMFold. The convergence of two architecturally different single-sequence models to equivalent accuracy suggests that protein language model representations may have reached a performance ceiling on CASP14 targets for current single-sequence approaches.

Beyond the scientific findings, this thesis contributes a four-layer verification methodology and an independent reproducibility kit. The 270-check verification harness, with 268 of 270 checks passing on the regenerated benchmark export and full disclosure of the remaining two numerical drifts, demonstrates that every numerical claim in this work is independently reproducible from the raw CSV using peer-reviewed external tools.

FastFold Suite demonstrates that meaningful computational biology research can be conducted on consumer hardware (NVIDIA RTX 4070) with open-source tools, and that a unified comparison platform reveals findings that isolated model evaluations miss.

VIII. Future Work

Several directions for future work have been identified based on the findings and limitations of this thesis:

Journal Publication: The IEEE paper presenting these findings will be submitted to a peer-reviewed computational biology or bioinformatics journal, targeting the ESMFold fold-class sensitivity finding and the unified benchmarking methodology as primary contributions.

Doctoral Research: The author has been accepted into Purdue University’s Doctor of Technology (DTech) program beginning Fall 2026. FastFold Suite research may continue as part of the doctoral work, expanding the benchmark set and incorporating newer model versions.

Expanded Protein Dataset: Extending benchmarking beyond the 50 CASP14 targets to include CASP15 proteins and additional fold-class categories would improve the generalizability of the fold-class sensitivity finding and provide a larger sample size for ANOVA analysis.

Domain-Level Alignment: Implementing domain-level extraction before TM-align comparison would produce absolute TM-scores directly comparable to published benchmarks, resolving the discrepancy between observed and published values.

Cloud Deployment: Migrating from local GPU execution to cloud infrastructure (AWS, Google Cloud) would support longer sequences and enable multi-user access.

Additional Models: Integrating newer models such as RoseTTAFold, OpenFold, and AlphaFold 3 would test whether the accuracy convergence finding holds for next-generation architectures.

References

[1] J. Jumper, R. Evans, A. Pritzel, et al., “Highly accurate protein structure prediction with AlphaFold,” Nature, vol. 596, no. 7873, pp. 583-589, 2021. https://doi.org/10.1038/s41586-021-03819-2

[2] R. Wu, F. Ding, R. Wang, et al., “High-resolution de novo structure prediction from primary sequence,” bioRxiv, 2022. https://doi.org/10.1101/2022.07.21.500999

[3] Z. Lin, H. Akin, R. Rao, et al., “Evolutionary-scale prediction of atomic-level protein structure with a language model,” Science, vol. 379, no. 6637, pp. 1123-1130, 2023. https://doi.org/10.1126/science.ade2574

[4] M. Mirdita, K. Schutze, Y. Moriwaki, et al., “ColabFold: Making protein folding accessible to all,” Nature Methods, vol. 19, no. 6, pp. 679-682, 2022. https://doi.org/10.1038/s41592-022-01488-1

[5] Y. Zhang and J. Skolnick, “TM-align: A protein structure alignment algorithm based on the TM-score,” Nucleic Acids Res., vol. 33, no. 7, pp. 2302-2309, 2005. https://doi.org/10.1093/nar/gki524

[6] A. S. Rose and P. W. Hildebrand, “NGL Viewer: A web application for molecular visualization,” Nucleic Acids Res., vol. 43, no. W1, pp. W576-W579, 2015. https://doi.org/10.1093/nar/gkv402

[7] M. Steinegger and J. Soding, “MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets,” Nature Biotechnology, vol. 35, no. 11, pp. 1026-1028, 2017. https://doi.org/10.1038/nbt.3988

[8] C. B. Anfinsen, “Principles that govern the folding of protein chains,” Science, vol. 181, no. 4096, pp. 223-230, 1973. https://doi.org/10.1126/science.181.4096.223

[9] J. Molt, K. Fidelis, A. Kryshtafovych, T. Schwede, and A. Tramontano, “Critical assessment of methods of protein structure prediction (CASP), Round XII,” Proteins, vol. 86, pp. 7-15, 2018. https://doi.org/10.1002/prot.25415

[10] A. Rives, J. Meier, T. Sercu, et al., “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences,” PNAS, vol. 118, no. 15, 2021. https://doi.org/10.1073/pnas.2016239118

[11] P. Virtanen, R. Gommers, T. E. Oliphant, et al., “SciPy 1.0: fundamental algorithms for scientific computing in Python,” Nature Methods, vol. 17, no. 3, pp. 261-272, 2020. https://doi.org/10.1038/s41592-019-0686-2

[12] W. Kabsch, “A solution for the best rotation to relate two sets of vectors,” Acta Crystallographica Section A, vol. 32, no. 5, pp. 922-923, 1976. https://doi.org/10.1107/S0567739476001873

[13] A. Zemla, “LGA: a method for finding 3D similarities in protein structures,” Nucleic Acids Research, vol. 31, no. 13, pp. 3370-3374, 2003. https://doi.org/10.1093/nar/gkg571

Appendix A: AI Usage Documentation

This document was developed with the assistance of AI-powered tools for writing quality assurance. Grammarly, an AI-driven writing assistant, was used throughout the drafting process to identify and correct grammatical errors, improve sentence clarity, and ensure a consistent academic tone. Grammarly’s suggestions were reviewed and accepted or rejected on a case-by-case basis. No content was generated solely by the tool. All research, analysis, system design, implementation, and intellectual contributions in this document are the original work of the author.

Leave a comment Cancel reply