Manual-ChemSAR

Input and output formats for each module?

Theory and how to choose a proper algorithm?

| Overview and a suggested pipelining

| Input & output formats

The table below listed requirements of formats for each module, and users should prepared data files as following before calculation.

Module name	Input	Output	Description
Feature calculation	.smi; .sdf	*.csv	The de facto standard version of molfile in SDF must be V2000; 2D or 3D information are both valid; The first row of *.csv file is the descriptor names; the first column is the SMILES of molecules.
Model selection	*.csv	data table in the page	The first column of X_train file can be molecular symbol like molecular names or IDs; the first row of X_train file must be descriptor names; the first column of y_train file must be the same with X_train file; the second column must be experimental value of the sample(different presentation styles of classes must be convert into 0 or 1 ).
Model building	*.csv	data table in the page; *.png	The same with "Model selection".
Prediction	*.csv	data table in the page; *.csv	The requirements of X_test file is the same with X_train file.
Validation of molecules	*.sdf	*.csv	The de facto standard version of molfile in SDF must be V2000.
Standardization of molecules	*.sdf	*.sdf	The same with "Validation of molecules".
Custom preprocessing	*.sdf	*.sdf	The same with "Validation of molecules".
Imputation of missing values	*.csv	*.csv	The first row of input file must be header like descriptor names; each column including the first one must be feature values like descriptor values.
Removing low variance features	*.csv	*.csv	The same with "Imputation of missing values".
Removing high correlation features	*.csv	*.csv	The same with "Imputation of missing values".
Univariate feature selection	*.csv	*.csv	The first row of input file must be header like descriptor names; each column from the first one to the penultimate one must be feature values like descriptor values; The last column must be experimental value of the sample(different presentation styles of classes must be convert into 0 or 1 ).
Tree-based feature selection	*.csv	data table in the page; *.csv	The same with "Univariate feature selection".
RFE feature selection	*.csv	data table in the page; .csv; .png	The same with "Univariate feature selection".
Statistical analysis	*.csv	data table in the page; *.png	The four columns of input file must be in order: molecular identifier,predict label,predict probability,experimental value; the label name can be defined by users.
Random training set split	*.csv	*.csv	The same with "Model selection".
Diverse training set split	*.sdf	*.sdf	The de facto standard version of molfile in SDF must be V2000; 2D or 3D information are both valid.
Feature importance	*.csv	*.csv	The same with "Univariate feature selection".

| Algorithms and parameters

ChemSAR now provides five main algorithms for building classification models. The algorithms along with their main parameters that need to be optimized are listed in the table below.

Algorithms	Parameters	Recommended parameters
RandomForest	n_estimators: The number of trees in the forest;	n_estimators:500;
	max_features: The number of features to consider when looking for the best split; (start_feature, end_feature and step make up the attempts of max_features)	max_features: sqrt(N); N stands for number of features;
	cv: cross validation fold	cv: 5
SVM	kernel type: rbf, sigmoid, poly, linear;
	C: penalty parameter C of the error term.;	C: 2^-5, 2^15, 2^2; (format: start, end, step)
	gamma: kernel coefficient for 'rbf', 'poly' and 'sigmoid';	gamma: 2^(-15), 1, 2^2;
	degree: degree of the polynomial kernel function;	degree: 1, 7, 2
	cv: cross validation fold	cv: 5
Naïve Bayes	Bayes classifier type;	BernoulliNB for binary-valued variable; GaussianNB for continuous variable
	cv: cross validation fold	cv: 5
K Neighbors	n_neighbors: number of neighbors to use;	n_neighbors: 1-10;
	cv: cross validation fold	cv: 5
DecisionTree	Algorithm: algorithm used to compute the nearest neighbors ('ball_tree', 'kd_tree', 'brute');	automatic decision;
	cv: cross validation fold	cv: 5

| Theory and how to choose a proper algorithm

In ChemSAR, we implemented five learning algorithms by now. The following is a brief description of each algorithm. Users can decide which one to choose according their purposes and datasets.

Random Forest ^[1]

Random forest (RF) is proposed by Breiman at 2001, which uses an ensemble of classification trees. Each tree in the ensemble is constructed using a different bootstrap sample of the data. In addition, when constructing trees, each node is split using the best feature among a subset of predictors randomly chosen at that node instead of using the best split among all variables. Due to this randomness, the bias of the forest usually slightly increases, while the variance also decreases for the averaging strategy. These strategies make RF algorithm have excellent performance in classification tasks. In building SAR classification models, it is suitable for dataset with many more features than observations and dataset with some noisy features. During the RF model selection, two main parameters (mtry and ntree) need to be optimized to achieve excellent performance. The mtry is the number of input variables tried at each split which is very important. The ntree means that how many trees to grow for each forest. To get a set of good parameters, a grid search strategy is used by inputting n_estimators, start_feature, end_feature and step in the corresponding submitting page. The n_estimators represent ntree, and the start_feature, end_feature and step make up the attempts of mtry. In addition, a k-fold cross validation process was implemented to assess the model performance, so a parameter cv (cross validation fold) is required for this calculation.

Support Vector Machine ^[2]

Support Vector Machine (SVM) was proposed by Vapnik and his coworkers. The optimal separating hyperplane strategy makes SVM have many nice statistical properties, which greatly accelerate the expansion of the SVM and the applications in a variety of fields. To obtain a good SVM model for classification, a set of parameters should be selected depending on the selection of kernel. Among these parameters, kernel type is the kernel type to be used in the SVM algorithm. It must be one of linear, poly, rbf and sigmoid. If rbf type is chosen, the ranges of C (the penalty parameter of the error term) and gamma (the kernel coefficient) are required to be set; if linear type is chosen, the range of C is required; if poly type is chosen, the ranges of C, gamma and degree (the degree of the polynomial kernel function) are required; if sigmoid is chosen, the ranges of C and gamma are required to be set. Then a grid search strategy will be employed to identify the optimal combination of their corresponding parameters. Besides, the parameter cv (cross validation fold) is also required for this calculation.

Naive Bayes ^[3]

Naive Bayes (NB) has been studied extensively since the 1950s, which are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features. NB works quite well in many real-world situations and studies of drug discovery in spite of their apparently over-simplified assumptions. Compared with other methods, to some extent, NB can work better than some more complex methods in the presence of noisy data. It requires a small set of training data to estimate the necessary parameters and is not sensitive to missing data. In addition, The NB algorithm can be extremely fast compared to other sophisticated methods, which makes it computationally efficient and suitable for handling large dataset under an acceptable model performance. In this module, two types of distributions are available for dealing with different types of data. The GaussianNB implements the NB algorithm by employing a Gaussian distribution. It is suitable for dealing with continuous data such as SAR data with descriptor features. The BernoulliNB implements the algorithms for data that is distributed according to multivariate Bernoulli distributions. To use this distribution, the samples need to be represented as binary-valued feature vectors such as SAR data with fingerprint features. With choosing a proper distribution type and setting the cv value, a model selection process will be started and result in a best choice.

K Nearest Neighbors ^[4]

K Nearest Neighbors (k-NN) is a simple, instance-based, supervised method which classifies an unknown sample based on a simple majority vote of the k nearest neighbors of each point, where k is a constant specified by the user. To determine the nearest neighbors, a Euclidean distance matrix of all samples in the training set is scanned for the k shortest distances to the observation of interest. To use k-NN method, the parameter k should be optimized because the best choice of k value depends on the data to a larger extent. In general, a larger k value reduces the effect of noise on the classification, but makes boundaries between classes less distinct. In this module, the best k value can be selected by inputting the range and step size to optimize in the corresponding submitting page. In addition, two additional points should be noted. Firstly, the feature scaling will be taken before building a k-NN model by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation. The reason to do this is that the accuracy of the k-NN algorithm can be sensitive to the distance measure. Secondly, when dealing with binary classification problems, it is better to set k as an odd number as this avoids tied votes.

Decision Tree ^[5]

Decision Tree (DT) can be used to build a predictive classification model by learning simple decision rules inferred from the data features. The tree is generated in recursive binary way, resulting in nodes connected by branches. A node, which is partitioned into two new nodes, is called a parent node. The new nodes are called child nodes. A terminal node is a node which has no child nodes. A DT procedure is generally made up of three steps. In the first step, the full tree is built using a binary split procedure. The full tree is an overgrown model, which closely describes the training set and generally shows overfitting. In the second step, the overfitted model is pruned. This procedure generates a series of less-complex trees, derived from the full tree. In the third step, the optimal tree is chosen using a cross-validation procedure. Among those commonly used learning methods, the DT has various advantages: a) it is simple to be understood and interpreted. b) It requires little data preparation. c) It gives a white box model which ensures that the observable situation can be easily explained by boolean logic. d) It performs well with large datasets. These advantages make DT a popular choice of algorithms when dealing with certain dataset. Here, in this module, there is no user-specified parameter to select because all the built-in functions will search for best parameters to build a tree.

[1]. Breiman L. Random forests[J]. Machine learning, 2001, 45(1): 5-32.
[2]. Pontil M, Verri A: Properties of support vector machines. Neural Comput 1998, 10(4):955-974.
[3]. https://en.wikipedia.org/wiki/Naive_Bayes_classifier
[4]. https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
[5]. Quinlan JR: Simplifying decision trees. Int J Hum-Comput St1999, 51(2):497-510.

Manual

Random Forest [1]

Support Vector Machine [2]

Naive Bayes [3]

K Nearest Neighbors [4]

Decision Tree [5]

Random Forest ^[1]

Support Vector Machine ^[2]

Naive Bayes ^[3]

K Nearest Neighbors ^[4]

Decision Tree ^[5]