AI/ML

AI / ML and Data

IMS and GC-IMS generate high-dimensional spectral and chromatographic data that cannot be reliably interpreted using simple threshold logic.

Artificial intelligence constitutes a core enabling technology of TeChBioT, transforming raw spectral signals into robust, real-time classification outputs under laboratory and field conditions.

Core Analytical Challenges

The AI/ML framework addresses three fundamental challenges:

AI/ML for Chemical Agent Detection

Chemical detection within TeChBioT focuses on volatile and semi-volatile chemical warfare agent (CWA) simulants and selected real CWAs measured under controlled laboratory conditions.

The detection platform operates using:

Standalone HT-IMS
Hyphenated HT-GC-IMS configurations

depending on the operational scenario.

These systems produce multidimensional datasets including:

Retention time
Drift time (or inverse reduced mobility K₀)
Intensity distributions
Dual-polarity ion information

In realistic environments such as gasoline vapour backgrounds or variable humidity conditions, spectral overlap and matrix effects significantly complicate interpretation.

AI-based pattern recognition was implemented to improve:

Sensitivity
Selectivity
Decision reliability

Preprocessing Pipeline

Before model training, chromatograms undergo systematic preprocessing.

Baseline drift and baseband noise are removed using wavelet transform techniques.
High-frequency noise is attenuated by wavelet shrinkage or Savitzky–Golay smoothing while preserving peak morphology.
Persistent homology methods are employed for robust peak detection, applied locally within automatically defined regions of interest to reduce computational load and avoid global threshold artefacts.

For GC-MS datasets:

Retention times are normalized using the Kovats retention index to mitigate variations arising from different column lengths and temperature programs.
Signals are scaled and standardized to ensure numerical stability and balanced feature weighting during training.

Supervised Learning Approaches

Multiple supervised learning algorithms were evaluated for chemical classification:

Support Vector Machines (SVM)
Logistic Regression
XGBoost
Multilayer Perceptrons (MLP)

Both modelling strategies were assessed:

Holistic strategy
Performs end-to-end classification directly from chromatographic input.

Modular strategy
Embeds ML into specific pipeline stages such as noise filtering or peak selection.

Laboratory and Field Performance

In laboratory experiments involving five CWA simulants:

Classification accuracy was consistently high across models.
SVM demonstrated particular robustness under class imbalance.

In outdoor validation campaigns involving the simulant DPM:

AI models produced rapid alarm decisions in real time.
No false positives or false negatives were observed.

The user interface displayed clear red (alarm) and green (safe) signals, demonstrating the feasibility of automated decision support in mobile deployment scenarios.

AI/ML for Biological Agent Detection

Biological detection presents fundamentally different analytical constraints.

Bacteria and viruses are non-volatile entities and require fragmentation prior to analysis.

Compared to chemical detection, biological classification must address:

Greater inter-class similarity
Lower signal-to-noise ratios
Higher variability across environmental matrices

Benchmarking was done using:

MALDI-TOF
Py-GC-MS

Evaluated ML and DL Models

A broad range of ML and DL models was evaluated to assess their ability to capture biologically meaningful patterns.

Classical ML approaches:

Random Forest
Support Vector Machines
Ridge Classifiers
k-Nearest Neighbors
XGBoost
Partial Least Squares Discriminant Analysis (PLS-DA)

Deep learning architectures:

One-dimensional and two-dimensional Convolutional Neural Networks (CNN1D and CNN2D)
Fully Connected Neural Networks (FCNN)
Denoising autoencoders
Established computer vision backbones such as ResNet and VGG

MALDI-TOF Preprocessing and Results

For MALDI-TOF spectra, preprocessing included:

Asymmetric least squares baseline subtraction
Savitzky–Golay smoothing
Normalization
Truncation to the 2,000–12,000 Da m/z range to remove matrix noise and low-informative regions

Internal datasets achieved perfect classification performance for:

Discrimination between bacteria and viruses
Gram-positive versus Gram-negative bacteria
A panel of seven bacterial and five viral species

External validation using an independent reference database from the Robert Koch Institute confirmed robust generalization.

The Extra Trees Classifier achieved:

100% accuracy for Gram classification
Approximately 80% accuracy for multi-class species identification

demonstrating strong transferability beyond the training dataset.

Py-GC-MS Analysis

A comprehensive Py-GC-MS dataset comprising 22 bacterial and viral classes was analysed using multiple data representations, including:

Full 2D GC×MS chromatograms
Total ion count (TIC) profiles
FAME features
Principal component features

Deep learning applied to the 2D GC×MS representation achieved the highest classification performance, highlighting the value of preserving spatial structure in chromatographic–mass spectral data.

Py-GC-IMS Preprocessing Pipeline

For Py-GC-IMS datasets, a structured preprocessing pipeline standardizes the chromatograms prior to modelling.

Processing steps include:

Two-dimensional interpolation to ensure fixed spatial resolution
Region-of-interest restriction based on retention time and inverse reduced mobility ranges
Savitzky–Golay smoothing to attenuate high-frequency noise while preserving peak structure
Reactant Ion Peak (RIP) identification and removal to avoid dominance of non-analyte features
Baseline correction via white top-hat morphological filtering
Intensity thresholding

Py-GC-IMS Classification Performance

For laboratory-generated Py-GC-IMS datasets:

CNN-based deep learning models achieved superior performance compared to classical ML approaches
CNN1D reached perfect classification accuracy under positive-polarity measurements
Dual-polarity configurations were preferred for robustness across conditions

During outdoor validation campaigns:

FCNN, CNN2D and denoising autoencoder architectures achieved accuracies above 85% despite domain shift between laboratory and field environments
Only a single misclassification was observed in the evaluated dataset

Data Simulation and Future Scalability

Because access to hazardous compounds and large annotated datasets is inherently limited, TeChBioT developed a chromatogram data simulator capable of generating synthetic chromatograms with:

Variable peak shapes
Baseline drift
White noise
Column degradation artefacts

This simulator:

Enhances model robustness
Enables stress testing under diverse synthetic scenarios
Supports future transfer learning strategies

From Sensor to Intelligent Decision Support

AI integration transforms TeChBioT from a signal-producing analytical device into an intelligent decision-support platform.

By:

Reducing false alarms
Compensating for environmental variability
Enabling hierarchical biological classification

AI and deep learning significantly enhance operational reliability.

The combination of HT-GC-IMS technology and advanced AI provides a scalable foundation for future deployment in:

Mobile platforms
UAV integration
Networked CBRN monitoring systems

AI/ML is not an auxiliary component but a central innovation pillar of the TeChBioT architecture.