Help-ChemSAR

Help

Search...

Introduction

ChemSAR is a web-based platform for the rapid generation of quantitative Structure‑Property and Structure‑Activity classification models for small molecules. Starting from given structure of small molecules, then, a step-by-step job submission process, ChemSAR will generate reliable and robust predictive models.

In order to show the application of ChemSAR in generating models online, an example is given in this page with detailed instructions and screenshots. In this example, an dataset for building the toxic and non-toxic chemicals classification model were taken from ( Cao D S, et al., 2015). The data.csv contains the SMILES of each sample and the y.csv contains the SMILES and the true label for each sample.

The following shows a detailed usage to establish a user customized model. All the related files are listed in the "Files and Download" section. All the related figures are listed at "Pictures" section.
Stage 1: Structure Preprocessing

In this stage, a structure preprocessing will be applied to the molecules in File 1. Six kinds of procedures can be selected by users according to their initial molecular data including ‘Adding hydrogen atoms’, ‘Removing salts’, ‘Removing hydrogen atoms’, ‘converted to SMILES format’, ‘Compute 2D coordinates’ and ‘Compute 3D coordinates’. Here, we choose ‘Adding hydrogen atoms’, ‘Removing salts’ and ‘Compute 2D coordinates’.(see Figure 1)

Input: File 1
Output: File 2
Stage 2: Start pipeline

We need to start a job to use the whole modelling process. Click the “Start new job” button to get a unique job ID. (see Figure 2)
Stage 3: Feature calculation

Here, ChemSAR enables users to compute 783 molecular descriptors and 10 kinds of commonly used fingerprints. File 2, the output file of stage 1, is uploaded and 194 features are calculated. These features include 30 constitution descriptors, 35 topology descriptors, 7 kappa descriptors, 32 autocorrelation-broto descriptors, 5 molecular properties, 25 charge descriptors and 60 moe-type descriptors. (see Figure 3)

Input: File 2
Output: File 3
Stage 4: Train and test split

After the feature calculation stage, the data should be split into training set and test set. A random split into training and test sets can be quickly computed with the train test split module. There are two main parameters for this calculation: “test size for the data” and “the random state”. Their exact definitions are listed below the input form. Here, we set 0.3 and 0 for the two parameters. (see Figure 4)

Input: File 4
Output: File 5, File 6, File 7, File 8
Stage 5: Imputation of missing values

This module can be used to impute the missing values in you data. Check your File 5 and File 7. If some values are empty (nan) or gibberish, you can take this step and impute the data. There are three parameters required: “the missing values type”, “the strategy” and the axis. Also, the exact definitions about these parameters are listed below the input form. Here, we find the value in column 39, index 35 is “-inf”. This value is not correct or cannot be recognized by built-in functions. In order to give a clearer picture of how to impute the data, we have manually removed several values randomly include the values mentioned above. (File 7 to File 9). Then, we use the default values for the parameters. (see Figure 5) Likewise, if other related data exists missing values you should also take this step to ensure an up-to-standard data.

Input: File 9
Output: File 10
Stage 6: Removing low variance feature

This is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features. Features with a variance lower than a threshold will be removed. Here, we set the threshold as 0.05. (see Figure 6)

Input: File 11 (delete the column: SMILES of File 5)
Output: File 12
Stage 7: Removing high correlation features

Removing high correlation features is another useful baseline approach to feature selection. It removes all features whose correlation doesn’t meet some threshold. Here, we set the threshold as 0.95. (see Figure 7)

Input: File 12
Output: File 13
Stage 8: Univariate feature selection

Univariate feature selection works by selecting the best features based on univariate statistical tests. It can be seen as a preprocessing step to an estimator. It removes all but the k highest scoring features. Here, we set the k value as “10” and the score function as “chi2”. In order to use “chi2” function we have manually removed the features contain negative values (This is just for an example). At the same time we added the y_true column (From File 6) to the end of the file. These two steps turn File 13 into File 14. (See Figure 8)
Stage 9: Tree-based feature selection

Feature importances can be computed by tree-based estimators. which in turn can be used to discard irrelevant features. This module employs a meta estimator that fits a number of randomized decision trees on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. There are three parameters required for the calculation. Their exact definitions are listed below the input form. Here, we use the default parameters to process. (See Figure 9 and Figure 10)

Input: File 15(add y_true column to File 14 and get File 15)
Output: File 16
Stage 10: Recursive feature elimination

Recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and weights are assigned to each one of them. Then, features whose absolute weights are the smallest are pruned from the current set features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached. In this module, RFE is performed in a cross-validation loop to find the optimal number of features and the support vector classification estimator with a linear kernel is utilized to process. There are two parameters required: step value and the fold value. Their definitions are listed below the input forms. Here, we use the default parameters. (See Figure 11 and Figure 12)
Input: File 15
Output: File 17
Stage 11: Model selection

In order to give an overall report for the whole modelling pipeline, a unique job ID is needed and should be created in stage 2. All the results from each step of modelling will be attached to this job ID. In this module, there are 5 kinds of commonly used machine learning methods available. They are “Random Forest”, “Support Vector Machine”, “Naïve Bayes”, “K Neighbors” and “Decision Tree”. Each method requires several parameters for the calculation. For some ones that need a parameter optimization, a general grid search method is utilized to perform. Here, take the “Random Forest” method for an example. We set the parameters as follows: ‘n_estimators’: 800, ‘step’: 1, ‘end_feature’: 31, ‘start_feature’: 1, ‘cv’: 10. The data tag for the data is just a mark for the file you uploaded and it is helpful when you check your data in the “My Data” module. The model tag for the model is a mark for the model you established currently and it is helpful when you check your models in the “My Model” module. Here, we type in “418selectrf” (See Figure 13). The calculation process will last for a while according to the data and method users selected. After clicking the “Submit” button, you will be redirected to the result page (See Figure 14). There, you can check your job status. After the completion of the calculation, click “Check Result” button again, the result contents will be displayed in the page (See Figure 15 and Figure 16). Since we use “session” and “AJAX” technique, there will be no “Time out (504 error)” any more. You can check your job status at any time at this page. Even, you can close your browser or shut down your computer and check the results according to the job ID at the “My Report” module. This is very convenient for users to accomplish a multi-step and time-consuming task.
Input: File 18(Add the smile column to File 16 and get File 18) and File 6
Output: Results in the page.
Stage 12: Model building

In the “Model Selection” stage, you have tried a series of methods and their parameters. Finally, you get a set of best parameters for your data. In this stage, you should input the method and the best parameters. Then, a reliable and desired model will be established. The results will be displayed in the result page and the model itself will be dumped into files for the use of “Prediction” module (See Figure 17 and Figure 18).
Input: File 18 and File 6
Output: Results in the page.
Stage 13: Prediction

In the “Model Building” stage, a reliable model has been established according to your requirements and has been saved temporarily. In the index page of this module, you will see a table contains the related information from the “Model Building” stage. Check the information and upload test set file to process if the information is correct. Here, we add the smile column to the start of File 10, and then we get File 19. Note that an automatic indexer will be employed to pick up the right feature columns (Features in X) of the uploaded file. Click the “Submit” button to start the prediction (See Figure 19). In the result page, an interactive table contains prediction results will be displayed. Several table tools are available for you to get what you want. At the bottom, a “Download as csv” button is for you to download the results (See Figure 20).
Input: File 19
Output: File 20
Stage 14: Statistical Analysis

If you have a test set for the model like in this example. Another selectable tool: “Statistical Analysis” is there for you to assess the performance of the model using the external test set. After the prediction, add the “y_true” column from File 8 to the end of File 20, and then we get File 21. Upload the File 21 and click the “Submit” button and we will be redirected to the result page (See Figure 21 and Figure 22).
Input: File 21
Output: Results in the page.
My Report

A specific feature of ChemSAR is that it provides a complete report generation system. It retrieves result data of each calculation and transforms them into a HTML page and a PDF file for users. After finishing going through the whole modelling pipeline, you can go to “My Report” module to obtain a well-organized report including information from “Feature Calculation” to “Prediction and statistic” (See Figure 23). At the index page of this module, all the job IDs that you have started will be listed there. A “Get a PDF” button allows you to generate a PDF for an off-line usage. A “Query” button is available for you to query the information about models created in other jobs. This is also very helpful when you attempt to building models using different methods and parameters or when you want to build more than one model at the same client at the same time (See Figure 24).
My Data

A file system that allows users to view files they uploaded in each step according to the tags users could reuse these files conveniently. In the index page, a file name meaning is explained. Once users upload a file, it will be stored in the user space and the file name will be listed in the table. The “Download” and “View” buttons allow users to download the file for reuse and view the file content. (See Figure 25)
My Model

A file system that used to store all the models built in the "Model building" stage. Under the same job ID, users can make multiple attempts to get a best model, but only the latest one from the "Model Building" will be regarded as the final model and be stored here! If users want to build more than one model, please start a new job! Users can start several jobs(Max: 10) to build different models. If all the 10 job IDs run out, users should delete the browser cookies (sessionid from http://chemsar.scbdd.com), and come back as a new guest user, or change for a new client. (See Figure 26, 27)
Files and Download

Initial file: data.csv, y.csv
File 1: data.sdf (File format convert)
File 2: data_addhs_rmsalts_2D.sdf
File 3: data_des.csv
File 4: y.csv
File 5: train_x.csv
File 6: train_y.csv
File 7: test_x.csv
File 8: test_y.csv
File 9: test_x_without_smile_nan.csv
File 10: test_x_without_smile_impute.csv
File 11: train_x_without_smile_194.csv

File 12: train_x_without_smile_std_0.05_180.csv
File 13: train_x_without_smile_std_0.05_180_corr0.95_110.csv
File 14: 98_with_y_no_neg.csv
File 15: train_x_110_with_y.csv
File 16: 110_tree_based_31.csv
File 17: 110_RFE_15.csv
File 18: 110_tree_based_31_with_smile.csv
File 19: test_x_with_smile_impute.csv
File 20: test_predicted.csv
File 21: test_predicted_with_y_true.csv
Pictures

Click the picture to see big one, and click again to go back.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

Figure 9

Figure 10

Figure 11

Figure 12

Figure 13

Figure 14

Figure 15

Figure 16

Figure 17

Figure 18

Figure 19

Figure 20

Figure 21

Figure 22

Figure 23

Figure 24

Figure 25

Figure 26

Figure 27