In silico Approach to Elucidate Factors Associated with GH1 β-Glucosidase Thermostability

β-Glucosidase is a class of hydrolytic enzymes that catalyzes the removal of the non-reducing β-Dglucosyl unit from various disaccharides and substituted β-D-glucosides. β-Glucosidase belongs to Glycoside Hydrolase (GH) families 1 and 3 and potentially has many biotechnological applications with thermostable enzymes are preferred over mesophilic homologs in different applications. In the present work, a comparative analysis of physicochemical properties and amino acids composition of 60 (20 mesophilic, 20 thermophilic and 20 hyperthermophilic) β-glucosidases were performed. Multiple sequence alignment and phylogenetic tree analysis were constructed. Analysis of Variance (ANOVA) showed that several physicochemical properties including molecular weight, isoelectric point, number of positively charged amino acids, and extinction coefficient are statistically different among β-glucosidases groups (P<0.05). The analysis also showed that content of amino acids Asp, Gln, Cys, His, and Thr is significantly higher in mesophilic enzymes whereas that of Glu, Lys, Tyr, and Trp is higher in thermoand hyperthermostable homologs (P<0.05). Overall, nonpolar amino acids were the most abundant amino acids group in β-glucosidase with no significant difference among meso-, thermo-, and hyperthermophilic enzymes. Conversely, the content of polar amino acids is statistically higher (P<0.05) in mesophilic enzymes whereas that of charged and aromatic amino acids is significantly higher (P<0.05) in thermoand hyperthermophilic counterparts. Finally, multiple regression analysis showed that both polar and aromatic amino acids contribute significantly (P<0.05) to the thermostability. Optimal temperature variation of 53% could be explained by these two groups of amino acids. In conclusion, several amino acids appear to contribute to the thermostability of β-glucosidases and the findings from this study should pave the road toward a better understanding of thermostability of β-glucosidases and protein engineering.


Sequence Alignment and Phylogenetic Tree Construction
The retrieved sequence of M-BGLs, T-BGLs, and HT-BGLs were aligned using muscle tool for multiple sequence alignment (MSA) at EMBL-EBI (https://www.ebi.ac.uk/Tools/msa/ muscle/) 30 . The alignment was retrieved in FASTA format and edited by BoxShade server (https:// embnet.vital-it.ch/software/BOX_form.html) 31 . Further, MSA was submitted to Phylogeny online tool (http://www.phylogeny.fr/) to construct a phylogenetic tree of selected BGLs sequences 32 using the defualt setting (maximum likelihood method, WAG substitution model, bootstrap 16).

Statistical Analysis and Significance Inference
Graphpad Prism 5 was used for calculating statistical parameters of physicochemical properties and amino acids compositions for M-BGLs, T-BGLs, and HT-BGLs. First, analysis of variance (ANOVA) was carried out to find whether there is a significant difference in the means of the parameters of M-BGLs, T-BGLs, and HT-BGLs. The null hypothesis states that there is no significant difference in the means of physicochemical properties and amino acid composition between M-BGLs, T-BGLs, and HT-BGLs. The confidence interval for significance was 95% and P-value <0.05 was considered significant. Next, where ANOVA detected a significant difference, post hoc Tukey's test was used for multiple comparisons of the means of two groups. Finally, attributes showed a significant difference between M-BGLs, T-BGLs, and HT-BGLs and correlated with the optimal temperature of BGLs activity were used for multiple linear regression analysis to construct a model for optimal temperature prediction from amino acids compositions.

Multiple Sequence Alignment and Phylogenetic Tree Construction
Total sixty GH1 BGL sequences for which experimental optimal temperature has been determined (20 M-BGLs, 20 T-BGLs, and 20 HT-  [50] a Amino acids at position 1-18 were predicted as signal sequence and removed.

Journal of Pure and Applied Microbiology
BGLs; Tables 1-3, respectively) were retrieved from the UniProt database. Multiple Sequence Alignment (MSA) analysis revealed that several amino acids motifs are conserved among all GH1 BGLs. β-Glucosidase is a single polypeptide protein that folds to form a GH1 classical (β/α)8 TIM barrel structure comprised of eight α-helices and eight β-strands linked by short loops. GH1 BGL utilizes two key glutamic acid residues as a general acid/base catalyst and nucleophile 87 . MSA   [66] showed the conservation of both Glu residues in all BGLs regardless of optimal temperature. The first Glu residue is the general acid/base conserved at position 166 (for BGL from Humicola insolens (HiBGL) as reference, see supplementary data S. Fig. 1) at conserved motif TXNEP (Thr-X-Asn-Glu-Pro) and the second Glu residue is the nucleophile conserved at position 377 in consensus sequence TENG (Thr-Glu-Asn-Gly) 88 . The active site is located at C-terminal of the barrel and is made up of two subsites namely glycon binding site (subsite -1) and aglycon binding site (subsite +1). The catalytic acid/base is located at the C-terminal of β-strand 4 and the nucleophile at the C-terminal of the β-strand 7 88 . In HiBGL, glycon binding site (subsite -1) lies at the bottom of the barrel with Gln17, His120, Trp121, Asn165, Tyr308, Trp427, Glu434, Trp435 and Phe443 residues 65 . MSA showed that these residues are conserved throughout BGL evolutionary history and the side chains of which interact with glycon moiety through both hydrogen and hydrophobic bonds. Conversely, aglycon binding site (subsite +1) is less conserved and is determined by Thr177, Tyr179, Phe325, Leu326, Thr331, Phe333 and Phe348 (in HiBGL numbering) which function as gatekeepers and explain the aglycone broad substrate specificity exhibited by this enzyme 89 . Moreover, Trp 168 and Leu173 were found to be responsible for glucose tolerance 90 and MSA showed that these two residues are conserved among high glucose-tolerant BGLs.
The aglycon appeared to anchor by hydrophobic contacts and water-mediated polar bonds 91 .
In contrarily, there are few studies on amino acids/motifs associated with thermostability of BGLs. Tamaki 92 .
MSA demonstrated that these residues are conserved and the majority of which are proline or positively charged amino acids. Additionally, these residues appeared to be distributed in the loop segments of BGL whereas, amino acids related to enzyme activity are mainly concentrated around α-helices and β-strands 92 . Altogether these residues represent a hotspot for BGL engineering in future. However, further studies to identify more amino acids variants and motifs related to the thermostability in M-, T-, and HT-BGL may be required.
Multiple sequence alignment was used to construct a phylogenetic tree to visualize the evolutionary relationship between M-, T-, and HT-BGLs. BGLs were clustered into three major clades (Fig. 1). Clade I was dominated by T-BGLs (9 sequences, 52.9%) followed by M-BGLs (5, 29.4%) and HT-BGLs (3, 17.7%). Clade II was dominated by M-BGLs (10, 43.5%) followed by T-BGLs (8, 34.8%) and HT-BGLs (5, 21.7%). Both mesophilic and thermophilic fungal BGLs analyzed were clustered together in this clade suggesting their bacterial origin. Clade III was dominated by HT-BGLs from both bacteria and archaea (12, 60%) followed by M-BGLs (5, 25%) and thermostable BGLs (3, 15%). Clustering of mesophilic, thermophilic and hyperthermophilic BGLs together indicates the existence of structural and functional similarities. The clustering of mesophilic and thermophilic protein is in agreement with previous reports 93,94 . Similarly, HT-BGLs from both archaea and bacteria were also clustered together in clade III indicating the structural similarity among them. Archaeal and bacterial proteins have been clustered together in several phylogenetic tree analyses 95,96 . This is because many proteins distinguishing these two domains belong to information processing proteins such as DNA replicating enzymes, and transcription and translation associated protein 97 .

Comparative Analysis of Physicochemical Properties
All GH1 BGLs appear to lack of signal peptide and to localize in the cytoplasm except M-BGL-18 which was predicted to have 18 residues and to localize in the periplasm. GH 1 BGLs are known to be localized in the cytoplasm 3 . Statistical analysis (ANOVA and followed by Tukey test) demonstrated that MW of HT-BGLs is significantly higher than M-BGLs or T-BGL (P<0.05, Table 4). Increase in the MW of HT-BGLs could be attributed to the higher content of larger amino acids such as Lys, Tyr and Trp and lower content of smaller amino acids such as Gly, Gln, and Cys in HT-BGLs 98,99 . Similarly, PI is significantly higher in HT-BGLs than M-BGLs and T-BGLs (P<0.05, Table  4). A similar finding was reported for thermostable nitrilase over their mesophilic counterparts 95 . PI indicates the pH at which the protein has an  Additionally, the analysis showed that numbers of positively charged amino acids (Lys and Arg) are higher in HT-BGLs than M-BGLs and T-BGLs (P< 0.05, Table 4). Increased content of positively charged amino acids in thermostable BGLs can be postulated to involve in salt bridge formations and thus enhancing protein thermostability [102][103][104][105] . Indeed, there is experimental evidence showing that the redesigning of salt bridge significantly enhanced BGL thermostability 106 . Finally, the extinction coefficient is also statistically higher in HT-BGLs than M-BGLs and T-BGLs (P<0.05, Table 4). Extinction coefficient reflects aromatic amino acids content (Phe, Tyr, and Trp) which in turn appears to enhance protein thermostability through increasing protein hydrophobicity and packing 106 .
Conversely, number of negatively charged amino acids, AI, II, and GRAVY did not show any statistical difference in their means among M-, T-, and HT-BGLs. Similar findings have been reported for nitrilase/cyanide hydratase family from mesophilic, thermophilic, and hyperthermophilic bacteria 95 and serine protease from mesophilic and thermophilic microorganisms 107 . AI indicates the relative volume occupied by the side chain of hydrophobic amino acids (Ala, Val, Leu, and Ile) and may suggest thermostability of protein. AI was higher for all BGLs analyzed in the present study suggesting their overall stability 108 .

Comparative Analysis of Amino Acids Composition
ANOVA analysis demonstrated that Asp, Cys, Gln, His and Thr are significantly higher in M-BGLs than HT-BGLs and T-BGL homologs (P< 0.05, Table 5). These amino acids are unstable at higher temperature and undergo either oxidations Journal of Pure and Applied Microbiology or deamination at higher temperature explaining why they are less common in thermostable protein compared to mesophilic homologs 22,95,104,[109][110][111] . Cys specifically plays a dual role by, on one hand, reducing thermostability through increasing internal cavities and oxidation at a higher temperature and, on the other, increasing thermostability through the formation of disulfide bonds which enhance protein rigidity and stability 112 . Conversely, Glu, Lys, Trp, and Tyr are significantly higher in HT-BGLs than their T-BGLs and M-BGLs counterparts (P<0.05, Table 5). Glu is negatively charged amino acids common in both exposed and buried region of the protein and involved in electrostatic interactions. Farias et al. (2003) found E+K increased and Q+H decreased in thermostable protein suggesting E+K/Q+H ratio can be used as an indicator of thermal stability 113 .
Similarly, Lys is positively charged amino acid which involves in ionic interactions resulting in enhanced thermo-stability and hence it is more abundant in thermophilic and hyperthermophilic proteins [114][115][116] . Furthermore, both Trp and Tyr are aromatic amino acids which are more common in thermostable protein than their mesophilic homologs 13,117 . Aromatic amino acids contribute to protein thermostability through π-π and cation-π interactions 12,118 . Gly was significantly higher in T-BGLs than HT-BGLs or M-BGLs homologs (P<0.05, Table 5). Gly is small hydrophobic amino acid responsible for creating void or cavity in the interior of protein thus hyperthermostable protein are evolved to have less Gly content to minimize the cavities which may disturb protein upon temperature increase 104,117 . The analysis also showed that there is no significant difference in the means of nonpolar amino acids Ala, Ile, Leu, Met, Phe, Pro, Val, and polar amino acid Met, Arg, Asn, and Ser between M-BGL, T-BGL and HT-BGL homologs (P>0.05, Table 5). Ala is the best helix forming residue associated with increased thermostability and packing of the protein 119,120 . Ile was found to be more common in thermostable compared to mesophilic protein 100 . Phe is a hydrophobic amino acid that tends to bury inside protein thus was higher in hyperthermophilic protein than their meso-and thermophilic homologues 121 . Previous research reported that α-helices of thermophilic protein are more stable than those of mesophilic homologs perhaps due to the high abundance of amino acids with greater propensity to form α-helices (Ala, Leu, Arg) and low abundance in β-branch sheet forming residues (Val, Ile, Thr). α-helices of thermostable protein can also be stabilized by interactions between side chains of amino acids such as Glu and Arg 119,122,123 . Pro has pyrrolidine ring which allows it to have least conformational states and low conformational entropy restricting the configuration of preceding amino acids thus it is more common on rigid and turn conformations and hence reported to be higher in thermophilic protein 116 . Pro has been used to increase protein thermo-stability and can be considered, here, a potential hotspot to enhance thermostability of BGLs 124 . Similarly, Met, Asn, and Ser are thermo-labile that undergo either oxidation or deamination (Asn) at elevated temperature and are therefore less common in the thermostable protein 125,126 . Indeed, the substitution of Ser by Ala in thermophilic protein is widely reported 100 . Arg is a positively charged residue that participates in electrostatic bond formation to enhance protein stability 127,128 . The present study cannot justify why the residues such as Ala, Phe, Arg, and Pro which generally contribute to thermostability are not statistically higher in thermostable BGLs than mesophilic one. However, it is important to note that this study compared protein sequences solely from one family (GH1 BGLs) whereas previous studies compared protein sequences from several families; it is well-reported that different protein families adopt different strategies to enhance their thermostability 12 . Collectively, nonpolar amino acids (Ala, Gly, Ile, Leu, Met, Pro, Val) were the most abundant amino acids in all BGLs accounting for about 42.5% of total amino acids with no statistical difference in their means between M-, T-, and HT-BGLs (P>0.05, Table.5). Nonpolar amino acids are buried in the interior of protein and influence its hydrophobicity which is the major interacting force responsible for the stability of protein core 104,117 . Chakravarty et al. (2002) reported that nonpolar amino acids are relatively higher in thermophilic protein than their mesophilic protein 114 . Conversely, polar amino acids (Asn, Gln, Ser, Thr, His, Cys) are significantly higher in M-BGLs than T-BGLs and HT-BGLs (P<0.05, Table 5). Decrease of polar amino acids in thermostable enzymes contributes to thermostability by minimizing cavities, Gln-and Asn-induced deamidation, and Cys, Ser and Thr oxidation at higher temperatures. This finding is in agreement with previous reports 13,125,126,129 . In contrary, charged amino acids (Glu, Asp, Lys, Arg) are higher in HT-BGLs and T-BGLs than M-BGLs (P< 0.05, Table 5). Increase of charged amino acids in the thermostable protein was previously reported and appears to mediate protein thermostability through the formation of hydrogen and ionic interactions 115,126,130 . Finally, aromatic amino acids (Phe, Tyr, Trp) are also significantly higher in HT-BGL than M-BGL and T-BGL analogs (P<0.05). This increase in aromatic amino acids enhances thermostability by increasing hydrophobicity of protein through cation-π and π-π interaction 131 and compactness/packing of protein and decreasing cavities 106 .

Multiple Regression Analysis
As previously demonstrated, the mean numbers of positively, polar, and aromatic amino acids are significantly different between M-, T-, and HT-BGLs with both positively and aromatic amino acids are directly correlated with optimal temperature (r= 0.62 and r= 0.65, respectively) and polar amino acids are negatively correlated (r= -0.55). These variables were used to perform multiple linear regression to determine a model for predicting optimal temperature. However, the positively charged amino acids were excluded from the model because it failed to be a significant predictor as indicated by individual test (P> 0.05). The model was constructed with polar and aromatic amino acids which significantly predicted optimal temperature with R square value of 0.53 indicating that variance in optimal temperature of 53% could be explained by the variation of these two groups of amino acids. This model also has multiple correlation coefficients R of 0.741 indicating that a high-quality prediction of this model. Additionally, βcoefficient indicates that aromatic amino acids (Trp+Tyr) contributed more to predicting optimal temperature than polar amino acids ( Table 6). Of note, the low prediction value of this model (53%) is because thermostability cannot be solely predicted from the primary sequences of protein 132 .

CONCLUSION
Thermostable BGLs differ from their mesophilic counterparts in several physicochemical properties such as molecular weight, isoelectric points, positively charged amino acids, and extinction coefficient. The high abundance of nonpolar amino acids in all BGLs may indicate general stability of BGLs. Additionally, increase in aromatic amino acids (Tyr and Trp) and decrease in polar amino acids (Gln, His, Thr, Cys) contributes significantly to BGL thermostability probably by combined mechanisms of increased hydrophobicity and decreases cavities of globular proteins. Charged amino acids (Lys and Glu) may also contribute to BGL thermostability through the formation of ionic bonds. Overall, these amino acids may be targeted through protein engineering for the conversion of mesophilic BGLs to their thermostable analogs. However, thermostability cannot be predicted solely from amino acids composition since the spatial arrangement of amino acids and structural feature of protein influence protein thermostability. Therefore, future analysis should focus on characterizing amino acids motifs and secondary structure of mesophilic and thermophilic BGLs to elucidate more attributes associated with thermostability. Furthermore, benefiting from a large number of X-ray crystallographic structures of BGLs elucidated to date, a comparative analysis of 3D structures may provide a deep insight into the difference between mesophilic and thermophilic BGLs thus paving the road toward successful protein engineering of this industrially valuable enzyme.