Abstract
Myelodysplastic syndromes(MDS) is a genetic disease that affects stem cells in the bone marrow. It infects around 10,000 people yearly in the united states alone [1]. Studying the genes of patients carrying MDS in contrast to healthy genes would provide insights on how to diagnose and treat it. Biologists are recently studying both protein-coding and non-coding genes roles in MDS through gene expression data. In this paper, two gene expression datasets are obtained and processed to identify the most statistically significant genes that contribute to the best clustering results. Those genes are further analysed to study their deferential expression levels. The paper concludes that the majority of the selected features are deferentially expressed and provide a high Adjusted Rand Index (ARI) score(i.e. exceeding 0.8 for both datasets) when used for clustering MDS and control groups. We propose these genes to be further studied by the biologists for identifying their potential relation to MDS disease and possibly assist in MDS diagnosis.