Abstract—The preprocessing of mass spectrometry (MS) data
is a crucial step in every MS study, which not only makes data
comparable and manageable but also makes the study more
reproducible. However, an essential part of this process, which
is often overlooked, is peak matching. Although existing
clustering methods have been applied for peak matching, the
use of these methods have been limited. For example, the use of
hierarchical agglomerative clustering (HAC) for matching of
mass/charge signals has been constrained to small-scale MS
data sets due to the computational complexity of HAC. In this
paper, we reintroduce a bi-directional hierarchical
agglomerative clustering (BHC) as a scalable and accurate
peak matching technique. As a result, the computational
complexity of hierarchical agglomerative clustering for peak
matching was optimized by BHC to O(RlogR). BHC was
benchmarked against existing peak matching techniques.
Finally, we propose a parallelization framework that
significantly reduces the peak matching method’s computation
time.
Index Terms—Mass spectrometry data preprocessing, peak
matching, hierarchical agglomerative clustering, parallel
computing.
Nazanin Zounemat Kermani, Xian Yang, Yike Guo are with the
Department of Computing, Data Science Institute, Imperial College London.
James McKenzie and Zoltan Takats are with the Faculty of Medicine,
Department of Metabolism, Digestion and Reproduction, Imperial College
London (e-mail: n.kermani@imperial.ac.uk).
Cite: Nazanin Zounemat Kermani, Xian Yang, Yike Guo, James McKenzie, and Zoltan Takats, "A Bi-directional Hierarchical Clustering (BHC) for Peak Matching of Large Mass Spectrometry Data Sets," International Journal of Machine Learning and Computing vol. 11, no. 6, pp. 373-379, 2021.
Copyright © 2021 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).