A Lung Transcriptomic Analysis for Exploring Host Response in COVID-19

Severe Acute Respiratory Syndrome Corona Virus-2 (SARS-CoV-2) rose without precedent for Wuhan, China, in December 2019. It is a kind of exceptionally pathogenic human coronavirus (HCoV) which causes zoonotic sicknesses and represents a significant risk to general wellbeing. Recognizing the hidden biology and pathogenesis of this novel coronavirus is extremely critical to comprehend as well as boosting the treatment of this deadly pandemic. The point of this study is to recognize key genes which show significant expression in the SARS-CoV-2 infected lungs as compared to healthy ones. Our analysis uncovered 149 gene-signatures that show substantial up-regulation in COVID-19 lungs. Out of these, top ten dysregulated genes STAP1, CASP5, FDCSP, CARD17, ST20, AKR1B10, CLC, KCNJ2-AS1, RNASE2 and FLG are found to be significant based on various crucial statistical factors and may end up being acceptable helpful drug targets.


INTRODUCTION
The novel coronavirus -Severe Acute Respiratory Syndrome Corona Virus-2 (SARS-CoV-2) which is the causative agent of recent Coronavirus Disease-2019 (COVID-19) pandemic and has infected around 3.2 million peoples and claimed over 2,24,000 deaths (https://www. who.int/docs/default-source/coronaviruse/ situation-reports/20200501-covid-19-sitrep. pdf?sfvrsn=742f4a18_2 ) 1-3 . It is not just a health crisis but has become the cause of global public, financial and political turmoil. No vaccine and antiviral drug are available right now for its clinical management.
It is therefore imperative to study the host response toward the viral infection at the cellular level in order to enhance our understanding of the disease pathophysiology and be able to identify drug molecules for combating the disease. To meet these ends, a recent brilliant study conducted at Icahn School of Medicine at Mount Sinai, New York, USA generated an RNA-seq datasets comprising in vitro, ex vivo, and in vivo systems of SARS-CoV-2 and other related viruses' infection 4 .
In our study, we focused on the reanalysis of four of these samples to extract some new information beyond the original publication.

MATERIALS AND METHODS
The RNA-Seq dataset GSE147507 containing the gene count matrix of the samples of Healthy lung biopsy (GSM4462413, and GSM4462414) and COVID-19 lung (GSM4462415, and GSM4462416) used in this research was taken from the Gene Expression Omnibus (GEO) database 5 and submitted by Daniel et al. [4] The preprocessing of the data began with the filtering of those genes which are at low-count level. In this study, filtering was done by taking low count filtering method: Mean (filter features where Row Means < 10) and normalization of the data was carried out by trimmed mean of M-values normalization (TMM) normalization method. Since, the count data obtained from a sequencing based experiment may also be affected from the depth of sequencing and the variations in the composition of the features that are detected. Therefore, the TMM normalization method was deployed in our study to estimate relative RNA production levels from RNA-seq data. The checking of the quality and normalization of the data used was carried out by using R /Bioconductor software & packages [6]. EdgeR is a software package for analyzing the differential expression profile of replicated data at the exon, gene, transcript as well as tag level. A characteristic feature of its functionality comprises of empirical Bayes based methods which allows the assessment of biological variation at gene-specific level. In addition, it may also be used for those experiments in which low levels of biological replication exist. It also utilizes TMM normalization for the differential analysis purpose.
Differential expression analysis can be performed by various statistical tests. In this study, we used EdgeR as the differential Expression method using TMM normalization and zero dispersion with exact test. Statistically significant parameters such as p-value, foldchange, adjusted-p value, were also calculated. The Gene Ontology (GO) analysis for the biological process and human phenotype ontology were also performed using Enrichr tool 7 . Further, the protein-protein interaction (PPI) network was also constructed utilizing STRING 8 for all the DEGs obtained after analysis.

RESULTS
The samples obtained from GEO dataset originally contain 21795 genes/regions but filtering of this data is very much required in order to obtain differentially expressed genes as it eliminates the undesirable variation sources. After this preprocessing, the no. of genes/regions came down to 13890. Box plot and Density plot were used to check the quality of the data as shown in Figure 1 (A) and 1(B) respectively.
The identification of Differentially Expressed Genes (DEGs) in COVID-19 affected lung is vital for strategy development in order to detect and treat this pandemic. Normally, for DE analysis, both fold-change and p-value are considered to be the most important factors. Therefore, we took those genes which are having adj. p-value less than 0.01 and log Fold Change greater than 2 and 149 DEGs were screened for further analysis. The scatter plot and volcano plot were also obtained in order to decipher the regulation of the genes as shown in Figure 2(A) and 2(B) respectively 9 . It is clear from the plots that the DEGs obtained only show up-regulation, no down-regulated genes were obtained based on the significant threshold we choose. The top ten DEGs in the decreasing order of their fold change were mentioned in the Table 1.

Journal of Pure and Applied Microbiology
The GO component: Biological process was calculated by taking all 149 DEGs into consideration and the top GO sorted based on p-value ranking has been depicted in Figure 2(C). Similarly, human phenotype ontology was calculated in order to obtain a standardized vocabulary of phenotypic abnormalities encountered in human disease as shown in Figure 2(D).
STRING database [8] was used for the construction of PPI network of all screened DEGs. The least required interaction scoring was kept to 0.400 at the level of medium confidence and active interaction sources such as Experiments, Text mining, Co-expression, Databases, Neighborhood, Co-occurrence, and Gene Fusion were used. The disconnected nodes were removed as they are supposed to be less significant. The obtained network consists of 144 nodes & 770 edges (with an enrichment p-value of PPI less than 1.0e-16). The average node degree and avg. local clustering coefficient was found to be 10.7 and 0.545, respectively. The important functional enrichments in the network has been depicted in the Figure 3.

DISCUSSION
The pathway and GO analysis of the significant signature genes enrich various immunesystem related terms such as chemokine signaling pathway, cytokine-cytokine receptor interaction, NOD-like receptor signaling pathway, cytokine receptor binding, CCR chemokine receptor binding, and cytokine activity. Clearly, viral infection mounts a cytokine storm [10] which may lead to acute respiratory distress syndrome (ARDS) as well as acute lung injury (ALI), often leading to reduction proper functioning of the lungs or even death [4].
Top dysregulated proteins were also found to be involved in similar processes. Protein STAP1 in macrophage colony-stimulating factor receptor binding, CASP5, and ST20 in apoptosis, FDSCP in regulator of antibody responses, CARD17 in expression of pro-inflammatory cytokine IL-1b, CLC for immune suppression, interestingly protein RNASE2 is involved in antiviral activity against respiratory syncytial virus and its agnostics may be used as effective agent against COVID-19 [11]. We foresee our re-analysis would help researchers in designing effective therapeutics by screening some of these genes.
The samples, we have used in our analysis were not from in vitro, or from animal models but from COVID-19-infected patients. Given the paucity of gene expression data available on this disease, we hope our analysis could be further exploited for identification of potential drug targets. One limitation of this study is that -it actually requires a large number of samples for more robust analysis that our study lacks indeed.
In future, we intend to incorporate our outcomes from the re-analysis of RNA-seq data with other samples present in the GEO dataset and try to investigate the variation in the sequence (SNPs) and their impact on this deadly disease.

FUNDING
None.

ETHICS STATEMENT
This article does not contain any studies with human participants or animals performed by any of the authors.