Helmkit: fast and robust conversion of HELM notation to atomistic representations for large-scale macromolecular informatics

Helmkit: fast and robust conversion of HELM notation to atomistic representations for large-scale macromolecular informatics

May 29, 2026

Ramon Adàlia, Gemma Sanjuan, Tomàs Margalef, Ismael Zamora.

Abstract

The Hierarchical Editing Language for Macromolecules (HELM) provides a powerful framework for representing complex biomolecules, including peptides, oligonucleotides, and hybrid constructs, but existing tools for converting HELM notations to atomistic models suffer from limitations in speed, scope, and robustness. We introduce helmkit, an open-source Python library that enables direct, high-throughput conversion of HELM strings to RDKit molecular objects. Designed for general macromolecular structures, helmkit supports peptides, nucleic acids, chemical linkers, and hybrids, while natively handling inline monomers, special characters in names, and automatic inference of missing attachment points. Its streamlined architecture, with minimal dependencies and built-in parallelization, achieves processing speeds of up to 5,000 HELM entities per second. Validation on large-scale datasets from PubChem (878,442 entries) and CycPeptMPDB (7,298 entries) demonstrates near-perfect accuracy, with helmkit successfully parsing structures that fail in other libraries. By facilitating efficient, scalable analysis of diverse macromolecules, helmkit advances computational workflows in drug discovery, virtual screening, and biomolecular engineering.

Tags -