Introduction

Molecular networking (MN)1 within the GNPS web platform (http://gnps.ucsd.edu)2 has been used for the analysis of nontargeted mass spectrometry data in various fields3,4. MN relies on the principle that similar structures tend to form similar patterns in fragmentation mass spectra (MS2). MN is built up through the pairwise spectral comparisons of a dataset, creating an MS2 spectral network. This network is then enriched by annotating the experimental MS2 spectra against MS2 spectral libraries2,5 or compound databases (Fig. 1). In the resulting molecular networks, annotations can be propagated through the network edges to adjacent nodes6. MN can be used to map the chemical space of complex samples to facilitate the discovery of new molecules, especially analogs of known compounds2. For the analysis of liquid chromatography-mass spectrometry (LC-MS2) data, feature-based molecular networking (FBMN) combines MN with chromatographic feature-finding tools7.

Fig. 1: The concept of ion identity molecular networking (IIMN).
figure 1

The workflow integrates a MS1 feature grouping to connect different ion species of the same compound and b feature-based molecular networking to connect similar compound structures based on MS2 spectral similarity to yield c combined networks. d highlights the data processing steps to create IIMN networks in MZmine and GNPS. After feature detection and alignment across multiple samples, features are grouped based on the correlation of their chromatographic feature shapes (intensity profiles) and other MS1 characteristics. Subsequently, ion species of grouped features are identified with an ion identity library generated based on user input for included adducts, in-source modifications, and a maximum multimer parameter. After uploading these results to the GNPS web server, the IIMN workflow generates combined networks and an alternative output with all IIN collapsed into single molecular nodes to reduce complexity and redundancy.

During LC-MS ionization, a given compound can generate multiple ion species (e.g., protonated and sodiated adducts), which appear as individual nodes in a molecular network, due to different precursor mass-to-charge ratios (m/z). As various commonly detected ion adducts exhibit different fragmentation behavior during collisional activation (e.g., in collision-induced dissociation (CID) mode) (Supplementary Fig. 1), MS2 spectral networking on its own does not necessarily connect all ion adducts produced by a single compound. This often contributes to the unwanted separation of molecular families (subnetworks) and limits the propagation of library annotations through the networks. The two ion species that are most frequently represented in spectral libraries ([M + H]+ and [M + Na]+) typically stay unconnected.

Various tools have been developed for the grouping and annotation of ion species in LC-MS data. The first step, feature grouping, typically involves a retention time filter and the correlation of feature intensities across samples10,11,12. Other tools, such as CAMERA and CliqueMS, add a pairwise correlation of feature shapes to the grouping metric13,14. RAMClust provides an option to simultaneously process LC-MS data with MS2 from data-independent acquisition (DIA)10. While many tools10,12,13,14,15 directly interoperate with the feature-finding software XCMS16, MS-FLO was developed to process exported feature lists from MZmine17, MS-DIAL18, and XCMS. Generally, after feature grouping, ion species can be identified based on known mass differences. Connecting all ions that originate from the same molecule results in MS1-based groups, here referred to as ion identity networks (IIN).

In this work, we present Ion Identity Molecular Networking (IIMN) and showcase how to fuse MS2-based spectral networks with an additional networking layer based on MS1 feature shape correlation of identified ion species that originate from the same molecule. IIMN addresses this central bottleneck of unconnected ion adducts in MN and the general problem of feature redundancy in MS-based metabolomics8,9. We further show the initial validation of IIMN with a ground truth dataset with induced adduct formation by post-column infusion of salt solutions. Furthermore, we present IIMN results for two datasets of natural products standards as well as 24 publicly available experimental datasets.

Results

Workflow development

The IIMN workflow annotates and connects related ion species in feature-based molecular networks within the GNPS web platform. We integrate IIN into MS2-based molecular networks and demonstrate the application to LC-MS2 studies that make use of product ion scans acquired in data-dependent acquisition (DDA) mode. The IIMN workflow comprises feature grouping, feature shape correlation, and identification of ion species using a variety of feature-finding software tools, such as MZmine17, XCMS16, and MS-DIAL18 that make use of different algorithms for the identification of ion adducts. A table of extracted MS1 features, each with a consensus MS2 spectrum, together with IIN results are then uploaded to GNPS to run the IIMN workflow on the web server. The resulting ion identity molecular networks contain two layers of feature (node) connectivity, linking ion identities of the same compound by MS1 characteristics and structurally similar compounds by MS2 spectral similarity (Fig. 1). A detailed description of the IIMN workflow as well as a step-by-step tutorial are provided in the method section and can be found online in the GNPS documentation (https://ccms-ucsd.github.io/GNPSDocumentation/fbmn-iin/). The IIMN workflow is available online (https://gnps.ucsd.edu/) and the source code is shared on GitHub under an open source license.

Validation of IIMN by post-column infusion of salt solutions

To validate the identification of ion species with IIMN, we created an LC-MS2 benchmark dataset of a natural product mixture containing 300 compounds, in which we promoted adduct formation by post-column infusion of ammonium acetate or sodium acetate at different concentrations (Fig. 2a–e). The IIMN networks can be depicted in alternative layouts that illustrate complementary results within the same dataset. GNPS also provides networks with collapsed IIN to reduce the redundancy of different ion species by merging them into a single neutral molecule (M) node (Fig. 2c). In this dataset, IIMN successfully connects ion identities and reduces the size of a complex network by 56% to four major compounds. The increased connectivity facilitates the propagation of structure annotations to neighboring in-source fragments and an unannotated compound. Finally, the abundance change of identified adducts ([M + H]+, [M + NH4]+, [M + Na]+) in our benchmark dataset is in agreement with the different post-chromatography salt infusion conditions (H2O, Na-Acetate, or NH4-Acetate, Fig. 3), which validates ion species identification on a dataset level. For instance, the abundance of [M + Na]+ and [M + NH4]+ ion identities was significantly (p < 0.001) higher in the corresponding samples with the post-column infusion of sodium acetate or ammonium acetate, respectively, when compared to the control samples. The exclusive formation of an uncommon [M + ACN + NH4]+ in-source cluster after infusion of ammonium ions into an ACN-water gradient further verifies connected ion identities.

Fig. 2: Ion identity molecular networking.
figure 2

Depicted are three visualizations of the same ion identity molecular network from the post-column salt infusion experiments. a Sorting by ion identities reveals that MS2 similarity edges (blue) often link sodiated ions (e.g., [M + Na]+ and [2 M + Na]+) into a subnetwork that is separated from a subnetwork of ammonium adducts with protonated species. The pie charts indicate relative abundances in different salt addition experiments (Control (H2O), gray; Na-Acetate, yellow; NH4-Acetate, green). The complexity and redundancy are reduced by b sorting all ions of the same molecule in a circular layout and c collapsing all IIN into representative single molecular nodes. This option reduces the complexity of this IIMN from 43 feature nodes to four molecular nodes (A–D) and 15 feature nodes (−56%). d Lists the structure of all GNPS library matches and e propagated structures for D (based on A and C) and the in-source fragments A’ to D’. This subset of structurally related compounds gives a first statistical proof for high correct annotation rates during IIN in MZmine as adduct formation responds to the corresponding salt infusion, e.g., higher [M + Na]+ abundances in the sodium acetate buffer infusion.

Fig. 3: Statistical impact of salt addition experiments on ion identity abundances.
figure 3

The relative intensities of selected ion identities are plotted for each post-column infusion in triplicate. The significant change for [M + Na]+ and [M + NH4]+ ion identities in the corresponding post-column salt infusions compared to the control samples agree with the expected ionization behavior. The exclusive formation of an uncommon [M + ACN + NH4]+ in-source cluster in the ammonium acetate buffer infusion further verifies ion identity networking results. Boxplots visualize the median as a horizontal line, the mean as an x, the first (Q1) and third quartile (Q3) as the lower and upper hinges, and the whiskers corresponding to the minimum value below Q1 and the maximum value above Q3 within the 1.5 × IQR (where IQR is the interquartile range). The p-values of a Welch two-samples t-test and the corresponding number of ion identities n are provided for each pair of compared triplicate injections with different post-column salt infusion conditions. Source data are provided as a Source Data file.

Application of IIMN to 24 public datasets

To test the workflow with data generated from various sample types and on different experimental platforms, 24 public datasets were processed using the MZmine workflow and its metaCorrelate algorithm for feature grouping and ion identity networking (Fig. 4, Supplementary Table 1). All the specific parameters for processing are provided in the methods section (under Dataset processing). MZmine feature-finding parameters were optimized for each dataset by various coauthors, while the feature grouping and ion identity networking parameters were kept constant for better comparability. IIMN identified biologically relevant metal-binding compounds via post-column-induced ion species. In a native ESI-based metabolomics study, IIMN specifically revealed that the known siderophore yersiniabactin also acts as a zincophore (Supplementary Note 1, Supplementary Fig. 2)19 and was validated in animal experiments.

Fig. 4: Overview of IIMN results for 24 experimental datasets.
figure 4

a Summarizes the relative number of LC-MS features (with an MS2 spectrum) that were annotated by ion identities or matches to the GNPS spectral libraries. The increased annotation rate by propagating library matches to connected unannotated ion identities is highlighted and b displayed as relative gains with a mean increase by 35% compared to all library matches. c Comparison of relative ion formation tendencies measured as the number of ion identities. Boxplots summarize the statistics of overall n = 24 datasets by visualizing the median as a horizontal line, the mean as an x, the first and third quartile as the lower and upper hinges, and the whiskers corresponding to the minimum value below Q1 and the maximum value above Q3 within the 1.5 × IQR. Source data are provided as a Source Data file.

For a dataset with 88 extracts from feces and gall bladder of various animals, the comparison between feature-based molecular networking with and without the additional edges from ion identity networking demonstrates how IIMN complements and improves FBMN (Fig. 5). Here, IIMN combined multiple smaller networks and unconnected nodes into a large network of free bile acids and those conjugated to amino acids or sulfate. These results prove that IIMN is a suitable method to connect structurally similar compounds, such as isomers, based on MS2 spectral similarity scoring and feature shape correlation. FBMN only established one edge between subnetworks of free and conjugated bile acids. Overall, bile acid analogs were separated into multiple subnetworks and unconnected nodes with a clear trend of separating sodiated and protonated ion identities. Finally, the complexity and redundancy are reduced by collapsing all IINs into corresponding representative nodes. The final network has a reduced number of nodes and a higher density of edges between nodes with annotations to the same compound classes.

Fig. 5: Comparisons of a subnetwork with matches to bile acids from 88 feces and gall bladder samples of various animals (MSV000084170).
figure 5

This overview compares a the FBMN results to IIMN b before and c after collapsing all ion identity networks into single representative nodes. In the top row, nodes are colorized depending on the adduct that ion identities are based on. In contrast, the lower three networks emphasize nodes with MS2 spectra that match library spectra of specific compound classes, mainly bile acids and their conjugates. The collapsed network (c) reduces the complexity and redundancy of having multiple nodes per compound and only keeps MS2 spectral similarity edges.

IIMN also yielded additional structural information in the case of samples from the mold Stachybotrys chartarum (Supplementary Note 2, Supplementary Fig. 3). In this project, IIMN revealed novel phenylspirodrimane derivatives, which were verified by nuclear magnetic resonance spectroscopy (NMR)20. In the network, the increasing number of aliphatic hydroxyl groups was reflected by the maximum number of in-source water losses, whereas acetylation of hydroxy groups reduced this number. The manual inspection of IIMN results was facilitated by additional MS1 annotations provided by ion nodes that lack MS2 fragmentation data and are consequently unavailable to the FBMN workflow. During the creation of IIMN networks, further layers of additional feature connections can be supplied. One example is a relationship between ion identity networks based on neutral mass differences that annotate putative structure modifications between compounds (Supplementary Note 3, Supplementary Fig. 4).

From a global view on all 24 datasets, IIMN successfully reduced the number of unconnected LC-MS2 features and increased the connections to annotated compound structures (Supplementary Fig. 5, Supplementary Table 2). Annotation rates in all 24 datasets of 6% and 12% are in the expected range with contemporary MS2 library matching21,22 and MS1 ion annotation, respectively, especially with the here chosen restrictive IIN parameters (Fig. 4a). By propagating spectral library matches to first neighboring IIN nodes, the annotation rates of the test datasets were increased by an average of 35% (Fig. 4a, b). On the individual dataset level, the highest increase (325%) was observed for dataset 4 with more MS1 data points per feature and thus better feature shape correlation on the cost of a lower MS2 acquisition rate. Most datasets (16 out of 24) experienced an increase greater than 10%, while eight datasets were below this value. After inspecting the LC-MS2 files, we found various reasons for this difference. Datasets 11 and 12, for example, had a higher focus on MS2 acquisition with a high topN of MS2 events in the DDA settings that caused lower MS1 survey scan frequencies and hence fewer data points per features, resulting in lower IIN correlation and connectivity. For datasets 7 and 19, the MS2 annotation rate was low to begin with and hence few annotations could be propagated by IIMN.

Generation of IIMN-based spectral libraries

Besides the increase in feature annotations in individual datasets, IIMN also enables the generation of propagated spectral libraries, increasing and diversifying the library coverage beyond commonly considered ion species. In positive ion mode, for example, most mass spectrometrists routinely consider [M + H]+ and [M + Na]+ adducts, but less frequently [M + NH4]+, [M + Ca]2+, [M + K]+, and in-source fragments in their data analysis and hence library contributions. However, while inspecting the relative distribution of ion identities within all 24 datasets, marine samples showed a higher percentage of [M + NH4]+ adducts (24 ± 5%) when compared to all other datasets (10 ± 8%). Sodiated adducts that were expected to be elevated in marine samples (due to anticipated higher salt contents in the original sample), in contrast, are evenly distributed between all datasets with an average of 26 ± 6% (Fig. 4c). On average, protonated species contribute to 23 ± 6% of the overall ion identities in our test datasets, indicating spectral bias in public MS2 libraries such as MassBank of North America (66% [M + H]+) and GNPS (65% [M + H]+) (Fig. 6), and suggests that the community should provide MS2 spectra for other ion species of the same molecules to reference libraries. Here, IIMN can be used to expand the spectral libraries with additional adducts and in-source fragments in LC-MS experiments, which can significantly increase spectral library coverage and thus MS2 annotation rates. The potential to use IIMN to propagate spectral library matches to adjacent unannotated features with ion identity is evident from a mean increase of the annotation rate by 35% (Fig. 4a, b). By propagating high confident spectral matches (in this case, cosine >0.9 or authentic standards) to connected ion identities from the 24 public datasets and two datasets of natural products from the NIH ‘ACONN’ collection from which an original reference library was created, we created IIMN spectral libraries with a total of 2657 entries with a broader and more representative ion species coverage (e.g., 24% [M + H]+, 22% multimeric species, 17% [M + Na]+, 15% in-source fragments, and 13% [M + NH4]+). Such spectral libraries better represent ion species observed in typical metabolomics experiments (Supplementary Table 3 and Fig. 6).

Fig. 6: Analysis of the coverage and distribution of ion identities in public LC-MS2 spectral libraries (refer to Supplementary Table 3 for library origins).
figure 6

Two-thirds of the MassBank of North America LC-MS2-positive ion mode library entries were entered as [M + H]+ while only four other ion types reached more than 1000 entries, namely, [M + Na]+, [M + NH4]+, [M + K]+, and [M − H2O + H]+. Other in-source fragments, multiply charged species, and multimers are only covered for a few compounds. A significant number of entries were either annotated as negatively charged adducts (e.g., [M − H]) or were missing an annotation. As the ion identity naming was not harmonized, different versions pointing to the same ion identity were added to a total count. A similar ion annotation coverage was found in the GNPS spectral libraries. In contrast, libraries that were generated with the recently described MSMS-Chooser29 workflow on GNPS or the IIMN-based library extraction workflow, described here, show an overall broader coverage of different adducts, multimers, and in-source fragments. The depicted statistical visualization compares a subset of significant or representative ion identities. The IIMN-based numbers summarize the libraries from both the 24 experimental datasets and the two NIH natural product standards datasets with a total of 2659 library entries. Source data are provided as a Source Data file.

Discussion

In conclusion, by establishing relationships between different ion species originating from the same compound and structurally similar compounds, IIMN facilitates molecular network interpretation and compound annotation. The combined networks with two layers of feature connectivity enable strategies to present and analyze mass spectrometry data. Networks with collapsed IIN especially reduce the redundancy of detecting multiple ion species per compound. IIMN successfully connected more related compound annotations in datasets from different analytical platforms and over a variety of small molecule compound classes, hence reducing the number of unconnected nodes and increasing the annotation density in molecular networks. An exciting application of IIMN is the expansion of spectral libraries by (re)-processing public datasets and propagating spectral library annotations to create library entries of connected ion identities. The identification of ion adducts can reveal novel ionophores, some of which will be biologically relevant and are still underappreciated in the function of small molecules19,23. The integration into FBMN and the GNPS environment provided a platform to utilize IIMN in other related bioinformatics tools, e.g., SIRIUS24, CANOPUS25, and Qemistree26 for molecular formula and compound class level annotation, which will also facilitate additional validation of network connectivity. Direct interfaces to the GNPS-Dashboard and MASST27 support collaborative data visualization and repository scale MS2 queries, respectively. Furthermore, the open source code and generic connection between feature finding, ion identity molecular networking, and the online GNPS workflow encourage the implementation of interfaces to other feature grouping and ion identification algorithms. We anticipate that the option to add orthogonal relationships between features to IIMN will stimulate the integration and development of additional tools for spectral alignment and measures of feature–feature relationships28.

To reach a broad user base, we interfaced the IIMN workflow with three widely used open source MS processing tools (MZmine17, MS-DIAL18, and XCMS13,16). Detailed documentation and training videos are available online (https://ccms-ucsd.github.io/GNPSDocumentation/fbmn-iin/). The option to directly submit IIMN analysis from MZmine to GNPS in particular provides a simple entry point for new users.

Methods

Post-column salt infusion experiments

For salt addition UHPLC-MS2 experiments, a mixture of 300 natural products from the NIH NCGC collection was prepared in 100 µL methanol-water-formic acid (80:19:1, Fisher Scientific, San Diego, USA) at a concentration of 0.01 µM of which 2 µL were injected into a Vanquish UHPLC system coupled to a Q-Exactive quadrupole orbitrap mass spectrometer (Thermo Fisher Scientific, Bremen, Germany) in three technical replicates. For the chromatographic separation, a reversed-phase C18 porous core-shell column (Kinetex C18, 50 × 2 mm, 1.8 um particle size, 100 Å pore size, Phenomenex, Torrance, USA) was used. For gradient elution, a Vanquish (Thermo Fisher Scientific, Bremen, Germany) high-pressure binary gradient system was used. The mobile phase consisted of solvent A H2O + 0.1% formic acid (FA) and solvent B acetonitrile (ACN) + 0.1% FA. The flow rate was set to 0.5 mL/min. Samples were eluted with a linear gradient from 0–0.5 min, 5% B, 0.5–8 min 5–50% B, 8–10 min 50–99% B, followed by a 2 min washout phase at 99% B and a 3 min re-equilibration phase at 5% B. Post-column we infused ammonium acetate or sodium acetate solutions (50, 5 and 0 mg/L) at 10 µL/min (dilution factor 50) with a syringe pump to yield a final concentration of sodium or ammonium acetate of 1, 0.1, and 0 mg/L. Data-dependent acquisition (DDA) of MS2 spectra was performed in positive mode. Electrospray ionization (ESI) parameters were set to 52 psi sheath gas pressure, 14 AU auxiliary gas flow, 0 AU sweep gas flow, and 400 °C auxiliary gas temperature. The spray voltage was set to 3.5 kV and the inlet capillary to 320 °C. 50 V S-lens level was applied. MS scan range was set to m/z 150–1500 with a resolution at m/z 200 of 17,500 with one micro-scan. The maximum ion injection time was set to 100 ms with an automatic gain control (AGC) target of 1E6. Up to 5 MS2 spectra per MS1 survey scan were recorded in DDA mode with a resolution of 17,500 at m/z 200 with one micro-scan. The maximum ion injection time for MS2 scans was set to 100 ms with an AGC target of 3.0E5 ions and a minimum 5% C-trap filling. The MS2 precursor isolation window was set to m/z 1. The normalized collision energy was set to a stepwise increase from 20 to 30 to 40% with a single charge as the default charge state. MS2 scans were triggered at the apex of chromatographic peaks within 2–15 s from their first occurrence. Dynamic precursor exclusion was set to 5 s. Ions with unassigned charge states were excluded from MS2 acquisition as well as isotope peaks.

Ion identity molecular networking-workflow overview

In general, the ion identity molecular networking (IIMN) workflow starts with LC-MS2 data processing in one of the supported open source feature-finding tools. After the creation of an aligned feature list of all samples, ion species that originate from the same analyte are grouped and annotated by MS1 criteria, such as their retention time, feature shape correlation, and m/z difference. Here, such groups are named ion identity networks. Subsequently, information of detected features and their representative MS2 spectra, ion identities, and connections to other ion identities are exported and transferred to the GNPS web server for the molecular networking part (refer to tool-specific sections for details). After the construction of ion identity molecular networks, features share connectivity based on MS2 spectral cosine similarity and MS1-based feature shape correlation. In addition to this combined network, GNPS calculates a version with collapsed IIN, where one node represents multiple ions of the same molecule. Results are available in the GNPS web interfaces and as downloads in various open formats as tables and networking files to allow local visualization, reviewing, and post-processing.

The IIMN workflow aids the feature-based molecular networking workflow7 by adding MS1 specific information, which is provided as new columns in the quantification table and as additional edges in a Supplementary Pairs text file within the GNPS-FBMN workflow. The option of additional edges from other tools was introduced to stimulate and facilitate the development of new computational methods that link nodes in the resulting molecular networks and was initially developed for IIMN. The text format follows a generic comma-separated style with the columns ID1 and ID2 (matching the feature IDs in the feature quantification table and mgf), EdgeType (defining the method), Score (numerical), and Annotation. To enable a broad user base to employ ion identity molecular networking in their studies, three popular mass spectrometry processing tools, namely, MZmine17, MS-DIAL30, and XCMS( + CAMERA)13,16, were modified or adapted with additional export scripts or modules. In comparison to FBMN, IIMN can include features that are lacking MS2 fragmentation spectra but are connected to other feature nodes by MS1 IIN edges. Regarding a higher detectability by MS1 compared to triggered MS2 acquisition, the additional nodes with ion identities complement the resulting networks with information otherwise lost in FBMN or classical MN.

The general steps to create ion identity molecular networks

  1. (1)

    If needed, convert the spectral data files to an open format (e.g., mzML)

  2. (2)

    Import the data into one of the open source tools: MZmine, MS-DIAL, or XCMS

  3. (3)

    Process the data to create a feature list (aligned overall samples)

  4. (4)

    Perform MS1-based feature grouping and ion identity annotation

  5. (5)

    Export the feature list as a feature quantification table (.csv), an MS2 spectral summary file (.mgf), which contains a representative fragmentation spectrum for each feature, and a supplementary edges files (IIN files, .csv) (more information in the tool-specific workflow sections)

  6. (6)

    Create a metadata file to group samples for statistics (optional)

  7. (7)

    Upload all files to GNPS and start a new feature-based molecular networking job (MZmine can directly submit and start a new IIMN job on GNPS)

  8. (8)

    Download and visualize the results in a network analysis software (e.g., Cytoscape31, https://cytoscape.org/)

  9. (9)

    The option Download Cytoscape Data provides two.graphml networking files

    1. (a)

      The standard FBMN and IIMN networks (base directory)

    2. (b)

      IIMN networks with collapsed ion identity networks (in the gnps_molecular_network_iin_collapse_graphml directory)

  10. (10)

    The option Direct Cytoscape Preview/Download provides the IIMN network and its collapsed version as Cytoscape projects with various style presets

Refer to the documentation on how to run FBMN within GNPS and multiple mass spectrometry data processing tools.

https://ccms-ucsd.github.io/GNPSDocumentation/featurebasedmolecularnetworking/

For IIMN, refer to the related part of the GNPS documentation.

https://ccms-ucsd.github.io/GNPSDocumentation/fbmn-iin/

Generation of collapsed ion identity networks

One result of the GNPS-IIMN workflow is the combined networks with IIN collapsed into single nodes. For this, all ion nodes with the same IIN ID are merged into a representative node based on the feature with the highest library match score, if available, or otherwise the feature with the maximum abundance. While all IIN edges are collapsed, MN edges of all ion identities are redirected to their representative nodes so that duplicates replace existing edges if their edge score (cosine similarity) is higher. Limiting the number of MN edges to the one with the highest cosine similarity. Furthermore, representative collapsed nodes are extended by multiple attributes, including the intensity of each ion identity and their summed intensity. This enables the direct comparison of ionization tendencies and provides new visualization options. An example with pie charts of the ion abundances is demonstrated in Supplementary Fig. 3.

Cross-validation of MS2 spectral library matches and MS1 ion identities

In IIMN, nodes may combine annotations from MS2 spectral library matching and MS1 ion identity networking. As cross-validation, GNPS parses and harmonizes the ion species string of both the detected ion identity and matching spectral library entry before checking for equality. The results are reported as an additional column in the node table. This equality check facilitates manual reviewing and the spotting of discrepancies between the MS1 and MS2 annotations.

The ion string parser harmonizes an input (e.g., [M − H2O + 2H]2+) in the following steps:

  1. (1)

    Spaces are removed

  2. (2)

    Charge state is detected and removed from the input (2+)

  3. (3)

    Brackets are removed ([]())

  4. (4)

    Input is split into added (+2H) and removed (−H2O) parts

  5. (5)

    Both lists are sorted alphabetically (+2H sorted by letter H)

  6. (6)

    If the charge state is missing, it is calculated for all parts that are listed in a lookup table (e.g., +Na or +H correspond to charge 1+)

  7. (7)

    The harmonized string is constructed by concatenation of [M-all removed parts + all added parts]charge state.

As an example, the harmonized string [M + H]+ is produced by the input strings M + H, M + H + , and [M + H]+, which are all commonly found in the GNPS spectral libraries and as an output of various software tools.

The full open source code of the ion string parser and its latest charge lookup table can be found on GitHub (https://github.com/CCMS-UCSD/GNPS_Workflows).

IIMN with MZmine

MZmine lacked a functional algorithm to group and annotate different ion species of the same molecules. Therefore, a workflow was implemented and split into separate modules for feature grouping (metaCorrelate), annotation of the most common ions (ion identity networking), an option to add more ion identities to existing IIN iteratively, and modules to validate multimers and in-source fragments based on MS2 scans. Both the creation and expansion of ion identity networks follow customizable lists of adducts and in-source modifications to cover any type of multimers, in-source fragments, and adducts. The IIN procedure lists all possible ion identity pairs between two features and ranks them according to the maximum number of features that support a specific annotation, i.e., the ion identity network size. While a feature might be annotated as two different ion species that point to different metabolites, the current workflow will only create additional IIN edges and ion species metadata for the highest-ranking ion identity per feature. This filter decreases the number of spurious matches. Finally, the GNPS-FBMN export module was modified to export all needed files to run IIMN. The quant table (.csv) contains grouping and ion identity specific columns, and a new Supplementary Pairs text file lists all additional IIN edges. The user can limit the export to features with MS2 fragmentation spectra or include those with an ion identity. Consequently, the IIMN workflow on GNPS connects features without MS2 spectra only by their IIN edges. MZmine is the first tool to provide a direct submission to GNPS to start analysis jobs, consequently streamlining the workflow and lowering the entrancing energy needed to apply IIMN within GNPS.

In detail, the metaCorrelate feature grouping algorithm searches for features with similar average retention times, chromatographic intensity profiles (feature shapes) with a minimum percentage of intra-sample correlation and overlap, and minimum feature intensity correlations across all samples (Supplementary Fig. 6). The feature shape correlation is a vital filter to reduce false grouping significantly and can apply either a minimum Pearson correlation (favored) or cosine similarity. A requirement is at least five data points, two on each side of the peak apex. If a low MS1 scan rate leads to chromatographic peaks with less than five data points, it is advisable to either redesign the acquisition method or to turn off the feature shape correlation. Note that the latter is expected to reduce the ion annotation consistency and should be used with caution. Similarly, the feature height correlation across all samples is optional, provides the same correlation or similarity measures, and additionally, relies on constant ionization conditions for all samples. Therefore, this filter should be turned off if the conditions were changed throughout the study, e.g., by changing the separation conditions or ion source parameters. The general principle of the feature height correlation is that different ions of the same molecule should follow a similar trend in abundance across all samples of the same study. If any feature, such as an [M + H]+ feature, increases at least 10-fold, all grouped features, e.g., [M + Na]+ or [M + NH4]+, should never have a negative feature height correlation coefficient and should as well increase in abundance. If both the feature shape and feature height correlation filters are omitted, feature grouping is solely filtered by the retention time window and overlap. To annotate features on an MS1 level, ion identity libraries are created with a user-defined list of in-source modifications (fragments and clusters), a list of adducts, and a maximum multimers number parameters (Supplementary Fig. 6). Each adduct is combined with each modification to fill the library with ion identities for 1 M to the maximum multimers number. Ion identity networks are then created by applying all ion identity pairs to all pairs of grouped features to calculate and compare the neutral masses of features with specific ion identities (mass difference, charge (z), and multimer number). Optionally, after the creation of ion identity networks with the main library, further ion identities can be added iteratively to existing networks. This workflow enables the user to divide into commonly and uncommonly detected ion identities and ensures that each network contains at least two or more main ion identities. Finally, an ion identity network refinement provides filters for minimum network size and to only keep the largest (most descriptive) IIN per feature.

More on the integration of the new IIMN workflow in MZmine can be found online (http://mzmine.github.io/iin_fbmn).

Refer to the documentation and video tutorials on how to apply IIMN within MZmine and GNPS. The Youtube playlist “MZmine: Ion Identity Molecular Networking” contains instructions on data processing for IIMN and FBMN, a minimalistic and full IIMN workflow within MZmine, and theoretical background to feature shape correlation and ion identity molecular networking.

https://ccms-ucsd.github.io/GNPSDocumentation/fbmn-iin-mzmine/

https://www.youtube.com/playlist?list=PL4L2Xw5k8ITyxSyBdrcv70LDKsP8QNuyN

IIMN with XCMS (CAMERA)

The XCMS16 Bioconductor package32 is the most widely used software for processing untargeted LC-MS-based metabolomics data. Its results can be further processed with the CAMERA13 package to determine which of the extracted m/z-rt features might be adducts13 or isotopes33 of the same original compound. For the integration of XCMS and CAMERA into the IIMN workflow, utility functions were created (‘getFeatureAnnotations‘ and ‘getEdgelist‘) to extract and export MS1 based feature and edge annotations (i.e., grouping of features to adduct/isotope groups of the same compound). In addition, the utility function ‘formatSpectraForGNPS‘ is used to export MS2 spectra. These functions are available in the GitHub repository https://github.com/jorainer/xcms-gnps-tools. R-markdown documents and python scripts with example analyses and descriptions are available in the documentation. (https://ccms-ucsd.github.io/GNPSDocumentation/fbmn-iin-xcms/) The files exported by these utility functions can be directly used for IIMN analysis on GNPS. Note that theoretically, it is possible to use RAMClust10, CliqueMS14, or other packages available for XCMS that perform ion annotation. The results of these packages need to be reformatted to the introduced generic supplementary edges format. The CAMERA integration might serve as a reference and starting point.

IIMN with MS-DIAL

MS-DIAL34 is a polyvalent mass spectrometry data processing software capable of processing various nontargeted LC-MS metabolomics experiments, including ion mobility mass spectrometry (http://prime.psc.riken.jp/compms/msdial/main.html). MS-DIAL supports IIMN since version 4.1. After a standard data processing workflow with MS-DIAL, the alignment results can be exported for IIMN analysis using the GNPS export option. Detailed documentation and representative tutorials are available in the GNPS documentations (https://ccms-ucsd.github.io/GNPSDocumentation/fbmn-iin-msdial).

Dataset processing

All 24 datasets (Supplementary Table 1) were processed with the MZmine workflow. As each dataset originates from a different study and was acquired with different LC-MS methods, variable feature detection and alignment parameters were applied, which are summarized in Supplementary Table 4. For all datasets, the same parameters were used for the feature grouping module (metaCorrelate) and the ion identity networking modules, with the only exception that the feature height correlation filter was turned off to group features for the post-column salt infusion experiments. As described previously, this filter should only be applied if the ionization conditions and detection sensitivity are kept constant overall samples. The post-column infusion of different salt solutions for this study promotes the formation of specific ion species in the ionization source.

  1. (1)

    A pair of features were grouped with a retention time tolerance of 0.1 min, with a minimum overlapping intensity percentage of 50% in at least two samples in the whole dataset (gap-filled features excluded), a feature shape Pearson correlation greater equals 0.85 with at least five data points and two data points on each edge, and a feature height Pearson correlation greater equals 0.6 with at least three data points.

  2. (2)

    The initial creation of ion identity networks was performed using the ion identity networking module and a maximum tolerance of 0.001 m/z or 10 ppm, a comparison where a pair of features and a pair of ion identities only need to match in one sample, and an ion identity library created based on 2 M as the maximum multimers number, a list of adducts ([M + H]+, [M + Na]+, [M + NH4]+, [M − H + 2Na]+, [M + 2H]2+, and [M + H + Na]2+), and a list of in-source modifications ([M − H2O] and [M − 2H2O]).

  3. (3)

    Two iterations were applied to add more ion identities to the resulting networks of step 2 with an unchanged m/z tolerance.

    1. (a)

      To add a higher variety of adducts, a maximum multimers number of 2, a list of adducts ([M + H]+, [M + Na]+, [M + K]+, [M + NH4]+, [M − H + 2Na]+, [M − H + Ca]+, [M − H + Fe]+, [M + 2H]2+, [M + H + Na]2+, [M + H + NH4]2+, [M + Ca]2+, and [M + Fe]2+), and an empty list of modifications were used.

    2. (b)

      To add a greater variety of modifications and larger multimers, a maximum multimers number of 5, a list of adducts ([M + H]+, [M + NH4]+, and [M + 2H]2+), and a list of modifications ([M − H2O], [M − 2H2O], [M − 3H2O], [M − 4H2O], [M − HFA], and [M − ACN]) were used.

Dataset statistics

Ion identity molecular networking statistics on all datasets were extracted with a new MZmine module and exported to a comma-separated file (csv) for evaluation in Microsoft Excel. The module is included in the special IIMN build of MZmine. All available statistics were based on the spectral input file (mgf) and the resulting network file (graphml), which was downloaded from the dataset’s corresponding GNPS results page. The graphml file contains all ion identity molecular networking results, namely, the nodes representing individual features and the edges between nodes. The mgf spectral summary file contains the corresponding MS2 spectrum for each feature node. While classical MN and FBMN depend on MS2 data for each node, IIN creates new MS1-based edges that might include nodes without an MS2 spectrum in the resulting network. For a comparison between FBMN and IIMN, only nodes present within both networks (with an MS2 spectrum) are considered. A statistical summary and in-depth statistics on each dataset are provided in a supplementary Microsoft Excel workbook (Supplementary Data 1). Excerpts are summarized in Supplementary Table 2, and the different statistical measures and metadata items are described in Supplementary Table 5. One important measure is the identification density, i.e., all identified nodes and nodes with a maximum distance of n edges to at least one identified compound. Supplementary Figure 5 highlights how the additional edges of ion identity networking increase the identification density in the datasets, measured over a maximum distance of 1–5 edges. The increased density over one edge reflects the new links between unidentified to an identified node by IIN edge. The identification density is increased for 21 datasets, two datasets with poor identification rates exhibit no change, and one dataset lacks identifications. The maximum identification density increase of +8% over one edge results in a total of 42% of the nodes being either identified or directly linked to an identified compound. The network of the corresponding dataset, i.e., the post-column salt infusion study, contains a total of 22% identified nodes and 25% nodes with ion identity and MS2 spectrum in 134 ion identity networks. Ion identity molecular networking decreased the number of unconnected singleton nodes by −12% to a total of 42%. Filtering out nodes with poor MS2 spectra with less than four signals, which was used as the minimum number of signals for the library matching and FBMN networking, decreases the number of unconnected singleton nodes further to 29%. Consequently, the network contains many nodes without a match to any library or experimental spectra. Collapsing all nodes with IIN edges into molecular nodes reduces the total network size by −20%, which significantly reduces the overall redundancy and facilitates network visualization and analysis.

To extract the same statistics on any results from IIMN, download the networking results as a graphml file from a GNPS job page and use the mgf file of that analysis. The special MZmine IIMN build offers two modules in the Tools tab. More information and the latest IIMN enabled MZmine version are available (http://mzmine.github.io/iin_fbmn).

  • GNPS results analysis (IIMN + FBMN)

    • For a single analysis

    • This tool also offers the extraction of new spectral library entries

  • GNPS results analysis (IIMN + FBMN) of all sub

    • For multiple analyses at once

    • Generates statistics for each subfolder with exactly one graphml and mgf file (names do not have to match)

IIMN-based spectral library generation

From experimental datasets

To comprehensively cover the fragmentation behavior of a molecule, spectral libraries should contain fragmentation spectra of different ion species acquired with different instrument types and fragmentation methods. IIMN might serve as a solution to expanded spectral libraries. To create new spectral library entries based on IIMN, all 24 datasets were searched for ion identity networks that contain a match to the GNPS spectral libraries with a minimum cosine similarity of 0.9 and a minimum number of shared fragment ions of 4–6, depending on each dataset’s FBMN parameters. For each matching IIN, all contained ion identity features with an MS2 spectrum and at least three signals above 0.1% relative intensity were extracted as new library spectra. The new library entries were constructed based on the highest library match and its attributes, namely, the compound name, structure strings as SMILES and InChI, and the neutral mass, the ion identity provided the ion species information and the precursor m/z, and dataset-specific metadata was added manually. With these strict rules, a total of 538 spectral entries were extracted from all 24 datasets. The new library has a broader and more distributed ion identity coverage when compared to selected representative spectral libraries from MassBank of North America (MoNA) and GNPS. At the same time, it is similar to spectral libraries that were generated with the new MSMS-Chooser library creation workflow in the GNPS ecosystem (Supplementary Fig. 5). The new IIMN-based library was made publicly available through the GNPS-library batch submission (Supplementary Tab. 3, https://gnps.ucsd.edu/ProteoSAFe/gnpslibrary.jsp?library=GNPS-IIMN-PROPOGATED).

From a natural product compound library

The library creation workflow was repeated and refined on the mass spectrometry data collected for the NIH NPAC ACONN collection of natural products (2179 compounds) provided by Ajit Jadhav (NIH, NCATS). The IIMN workflow was optimized and then applied to two LC-MS datasets collected on mass spectrometers operating in positive ionization mode, the MSV000080492 acquired on a qTOF-MS maXis II (Bruker Daltonics, GmbH) and the MSV000083472 acquired on a Q-Exactive (ThermoFisher Scientific, MA). During feature-based molecular networking, library matching was limited to the manually created GNPS libraries, which were based on the same qTOF-MS dataset (GNPS-NIH-NATURALPRODUCTSLIBRARY_ROUND2_POSITIVE, minimum matched signals = 3, minimum cosine similarity = 0.6). A new library for both datasets was created with new spectral entries with at least two signals above 0.1% relative intensity and with ion identities matching to the adduct of the library matches. Furthermore, library matches were filtered by a sample list of compound names contained in LC-MS samples. The IIMN library creation workflow resulted in 805 and 1314 new library entries for the qTOF-MS and the Q-Exactive datasets, respectively. The new IIMN-based library entries were made publicly available through the GNPS-library batch submission and merged into the existing manually created library GNPS-NIH-NATURALPRODUCTSLIBRARY_ROUND2_POSITIVE (Supplementary Table 3). In total, we generated 2,657 IIMN-based new spectral library entries.

MZmine IIMN workflow for spectral library extraction

To extract spectral library entries from any IIMN results, download the networking results as a graphml file from a GNPS job page and use the mgf file of that analysis. The special MZmine IIMN build offers the “GNPS results analysis” module in the Tools tab to create library entries based on these two files and provided metadata. The minimum GNPS-library match score sets a threshold for the extraction of library entries. Furthermore, library matches can be filtered to also match the ion identity to the adduct of the library match. A simple comparison between the different reporting formats for adducts was implemented. It removes all spaces, square brackets, and plus symbols (e.g., harmonizing M + H and [M + H]+). Filters are available for new library entries with a minimum number of signals above a relative intensity threshold.

The latest information on the IIMN MS2 library generation workflow in MZmine is available online:

http://mzmine.github.io/iin_fbmn

Documentation on the GNPS-library batch submission is available at:

https://ccms-ucsd.github.io/GNPSDocumentation/batchupload/

Documentation

The documentation of the IIMN workflow is shared in the GNPS documentations on GitHub (https://ccms-ucsd.github.io/GNPSDocumentation/fbmn-iin/), which also covers FBMN, classical MN, and other related tools. Suggested parameters for FBMN are described elsewhere7 and the reproducible molecular networking protocol5 describes MN parameters with step-by-step instructions. MZmine17 provides help dialogs with parameter descriptions for each module and documentation links on their website (http://mzmine.github.io/documentation.html). Tutorials and other references for MS-DIAL18 are provided on their project website (http://prime.psc.riken.jp/compms/msdial/main.html). Bioconductor hosts the XCMS16 and CAMERA13 packages together with related information and their documentation (https://bioconductor.org/).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.