Natural products are a major source of novel drugs, and with the rise of antibiotic resistance, there is an urgent need to discover new compounds. Genome mining enables the rapid identification of biosynthetic gene clusters (BGCs) responsible for natural product biosynthesis. Predicting the structures of the synthesized compounds is key to guiding their targeted discovery, but this requires detailed knowledge of the enzymatic reactions at each step of the biosynthetic pathway. While the core scaffold of many natural products can often be predicted from the initial biosynthetic steps, later tailoring modifications remain difficult to model due to the limited characterization of the enzymes involved. Reliable reaction prediction requires first gathering functional annotations and performing evolutionary analysis of these enzymes—only then can accurate computational predictions be made. In the absence of automated tools, this process becomes time-consuming. Although phylogenetic trees with functional annotations are occasionally published, reusing them directly is labor-intensive and technically challenging.
PhyloNaP addresses this gap by providing a centralized collection of annotated phylogenetic trees for enzymes involved in natural product biosynthesis.
Database update: number of datasets expanded from ~18,500 to ~49,000
Now database have more datasets with more diverse sequences. You can read the details on the updated dataset generation pipeline here.Interface update:
Updates for the tree visualization page:Updates of the tree placement page specifically
New Feature: Personal Tree Visualization
Users can now explore and analyze their own annotated phylogenetic trees directly in the PhyloNaP interface. This new functionality allows researchers to:
MIBIG or MITE column with valid
IDs,
corresponding molecular structures can also be displayedAccess this feature through the View page to start analyzing your phylogenetic data with PhyloNaP's advanced visualization tools.
PhyloNaP's database contains comprehensive phylogenetic datasets for protein families involved in natural product biosynthesis. Each dataset includes:
High-quality, manually reviewed datasets from:
Computationally generated phylogenetic trees covering:
The dataset generation pipeline consists of a series of automated steps designed to collect, filter, and organize protein sequences into phylogenetically structured datasets.
Protein sequences were collected from four established resources:
Each sequence was enriched with metadata from multiple sources:
Sequences were clustered using MMseqs2
easy-linclust
(Steinegger & Söding 2017) with minimal length 80 and
sensitivity 7.5 (most sensitive mode),
generating broad clusters with relatively high diversity.
Clusters were then filtered to remove only SwissProt entries (likely primary
metabolism)
For each retained cluster:
MAFFT (auto mode)
(Katoh & Standley 2013)TrimAl
(Capella-Gutiérrez et al. 2009)<column>_others metadata fields.
Clusters with fewer than 10 sequences after this step are discarded.FastTree
(Price et al. 2010)TreeCluster
(Balaban et al. 2019, PLoS ONE 14(8):e0221068) using a
maximal evolutionary distance threshold of 4. Trees exceeding this threshold are
subclustered; each subcluster is then independently realigned, trimmed, deduplicated, and phylogenetic tree inferred.After iterative refinement, alignment quality is assessed using two criteria:
Trees are rooted using one of two approaches:
Each step includes quality control measures — sequence filtering, alignment completeness checks, and tree quality assessment — to ensure reliable phylogenetic reconstructions and meaningful functional predictions.
Users can efficiently explore the database using multiple filtering and sorting options:
PhyloNaP provides powerful interactive tools for exploring phylogenetic relationships and functional annotations. Each tree page combines evolutionary context with biochemical information to facilitate enzyme function prediction.
When you click on a node and select "Get the summary of the clade", a metadata summary is displayed. This summary shows all features associated with the selected node, and the descendant branches of that node will be highlighted in color.
By default, the features are sorted from those with the most identical values to those with the greatest diversity. However, users can adjust the sorting order manually using the arrow buttons. :
PhyloNaP enables users to classify their protein sequences by placing them onto curated phylogenetic trees using a robust, multi-step computational pipeline.
Each placement features a pendant length—the branch connecting the query to the placement node/leaf.
When your sequence is placed onto a reference tree, the result is shown as one or more colored dots. Here is how to read them:
The color of each dot reflects the pendant length (evolutionary distance between your query and the placement node). A color scale is shown at the top of the tree view: brighter and more saturated dots indicate shorter evolutionary distance, while pale or faded dots indicate greater distance. The table below summarises how pendant length values relate to placement reliability:
| Pendant length | Interpretation |
|---|---|
| < 0.1 | Very short evolutionary distance — close homolog; functional inference is likely reliable. |
| 0.1 – 0.2 | Short evolutionary distance — likely related, but minor functional differences are possible. |
| 0.2 – 0.5 | Substantial evolutionary distance — use caution when assigning specific function; broad functional class may still be informative. |
| 0.5 – 1.0 | Very large evolutionary distance — avoid direct functional extrapolation; use only for broad phylogenetic context or exploratory interpretation. |
When multiple placements are identified, the relative size of each dot reflects its Likelihood Weight Ratio (LWR):
Clicking on any placement dot displays the exact LWR and pendant length values — no need to search for placement IDs manually.
The Likelihood Weight Ratio reflects the relative probability of each placement within the tree. If there is only one placement, the LWR will be 1.0 by definition — this does not mean the sequence is a close match to the reference clade.
To assess placement quality, always check the pendant length (evolutionary distance). A long pendant length means the query is distant from the reference sequences, even if LWR is high. The dot color provides this information at a glance: brighter = closer, paler = more distant.
A query may have multiple placements when the algorithm finds several equally likely positions. This often happens when the query is distant from the known sequences, or when the tree topology is ambiguous. In such cases, focus on the largest dot (highest LWR) and verify the pendant length. You can toggle between showing only the best placement and all placements using the controls in the Placements panel.
The View page allows you to visualize your own phylogenetic tree and metadata interactively — without submitting anything to the server. All processing happens entirely in your browser.
Privacy note: Your files are read by the browser and never sent to our servers. The viewer uses the same rendering engine as the main PhyloNaP database pages, so what you see is exactly how the dataset would look after submission.
Tip: Use the View page to verify your data before going to Contribute.
PhyloNaP welcomes community contributions of curated phylogenetic datasets, particularly those derived from published studies. These datasets are essential for improving coverage and annotation quality across enzyme families. Submissions can be made via the Contribute page.
We strongly encourage submissions based on published phylogenetic analyses, which may be included in their original form to preserve reproducibility. We also welcome newly constructed phylogenies for enzyme families not yet represented in PhyloNaP. All submissions are subject to basic quality checks (e.g., alignment consistency, identifier matching, and completeness of metadata) prior to inclusion.
| Item | Details | Status |
|---|---|---|
| Phylogenetic tree | Newick format (.nwk, .newick, .tree, .contree) |
Required |
| Alignment | The multiple sequence alignment used to build the tree, in FASTA format (.fasta, .fa) |
Required |
| Annotation table | TSV or CSV with an ID column whose values match the leaf labels in the tree and alignment. Additional columns (e.g., function, organism, product class) will appear as metadata on the tree | Required |
| Dataset name | A short, descriptive name for the dataset | Required |
| Description | A brief description of the dataset (enzyme family, scope, organism range, etc.) | Required |
| Evolutionary model | The substitution model used for tree inference (e.g., LG+F+I+R4, WAG+I+G4). For IQ-TREE users: upload the .iqtree file and the model is extracted automatically. Important for placement accuracy |
Recommended |
| Alignment type | Select Full sequence or Domain. If your dataset is based on individual protein domains or heavily trimmed sequences, mark it as "Domain" — this affects placement interpretation and functional inference | Recommended |
| Author & publication | Name(s) of the tree author(s) and/or a DOI or reference to the associated publication | Optional |
| SMILES strings | Include substrate or product structures as a column in the annotation table. They will be used to render chemical structure images on the tree | Optional |
| Molecule images | Upload images of substrates, products, or reactions (PNG/JPG). Filenames should contain the matching leaf ID. For >100 images, use a .zip archive. Images will be stored in the database and displayed on the tree |
Optional |
To ensure reliability and interpretability of contributed datasets, we ask authors to follow these recommendations where possible.
Alignments should be of high quality, without extensive regions of ambiguous alignment or excessive gaps.
Trimming is recommended to remove poorly aligned or highly gapped regions that may introduce noise into tree reconstruction. At the same time, care should be taken to preserve informative positions and avoid excessive trimming that reduces phylogenetic signal. As a general principle, alignments should retain sufficient length and information content relative to the underlying protein sequences.
We recommend maximum likelihood (ML) or similarly robust methods for tree inference. Neighbor-joining (NJ) trees are generally less reliable and are primarily accepted when derived from published studies.
Inclusion of branch support values (e.g., bootstrap or SH-like support) is strongly encouraged, as it helps users assess the reliability of individual clades.
Please provide the substitution model used for tree inference. For IQ-TREE, submission
of the corresponding .iqtree file is recommended, as the model is
extracted automatically and improves placement accuracy.
Rooting is essential for the placement algorithm to work correctly. Please ensure that your tree is rooted before submission. If the rooting method is known (e.g., outgroup, midpoint, MAD), including that information is appreciated.
If your dataset is based on protein domains or heavily trimmed sequences rather than full-length proteins, please indicate this by selecting "Domain" as the alignment type. This distinction is important because domain-based trees may affect placement interpretation and functional inference.
The most common issue with submissions is mismatched identifiers.
Make sure the IDs in your annotation table match those in the tree and alignment
files exactly (including case and any version suffixes such as .1).
Use the View page to preview your data
and catch formatting problems before submitting.
PhyloNaP: a user-friendly database of Phylogeny for Natural Product–producing enzymes
Aleksandra Korenskaia, Judit Szenei, Lisa Vader, Kai Blin, Tilmann Weber, Nadine Ziemert
bioRxiv 2025.09.23.677986 · doi: 10.1101/2025.09.23.677986
PhyloNaP can be self-hosted using the source code on GitHub:
ZiemertLab/PhyloNaP_WebAppIncludes Docker support and setup instructions.
For questions, feedback, or collaboration inquiries:
aleksandra.korenskaia@uni-tuebingen.de
Ziemert Lab, University of Tübingen, Germany
This project has received funding from the European Union's Horizon Europe programme under the Marie Skłodowska-Curie grant agreement No 101072485.
License
Released under the MIT License. Free to use for academic and commercial purposes.
Data Privacy
Cookies & Tracking
This site uses only essential session cookies. No analytics or third-party tracking is used. Full disclaimer →