Gene microarray data generally includes high dimension, small sample datasets prone to noise. Analyzing this data using supervised and non-supervised learning algorithms is extremely useful for gene characterization, disease diagnosis, and genetic therapy in the medical field. For many years, principal component analysis (PCA) has been used as a tool in algorithms for gene expression classification. Previous solutions utilize L2 norm based PCA, however with its superior resistance to outlier data, L1 norm PCA offers improved results. Both methods are compared using support vector machines (SVM) to classify genetic mutations and co-regulation in several publicly available datasets. Methods utilizing L1 PCA result in improved accuracy compared to L2 PCA when used as a pre-processing step to SVM classification for gene microarray data.
|