Available Analysis Software

Note: There are a variety of analysis tools available on the HMS DRSC website and some of these are also applicable to mammalian systems. Please look under the Online Tools section.

Analysis Software for Small Molecule and RNAi Data

1. Dotmatics Vortex is a data visualization software suite that enables screeners to graph very large data sets for the purposes of identifying outlying data points (e.g. hits), visualizing trends in data, and comparing data sets (replicates, plates, days, different screens, etc.). This software package has built-in data mining, statistical and calculation functions.

  • Utilized by ICCB-Longwood staff when examining pilot small molecule and RNAi data and producing final screen result data summaries.
  • Vortex is no longer available for screeners to use at ICCB-L. 
  • To request training for your own individual Vortex license, please schedule a brief learning session by contacting Jennifer Splaine.

2. Ingenuity Pathways Analysis is a biological pathway analysis tool that identifies relationships among genes in a list (e.g. hits in an siRNA screen).

  • Note: This web-based analysis tool is no longer available for use by ICCB-Longwood screeners at Harvard Medical School, and now requires purchasing an individual license.
  • IPA is available on Mac or PC, using Chrome, Internet Explorer, FireFox or Safari. 

Analysis Software for Small Molecule (Chemistry) Data

1. BIOVIA Pipeline Pilot is useful for analyzing small molecule screening data. It can rapidly analyze large datasets with user-specified computational protocols that are assembled in a modular fashion. The application’s full functionality is available on only one workstation at the ICCB-Longwood screening facility and is typically used solely by ICCB-Longwood staff. There are a variety of protocols that have been developed for commonly requested analytical functions (see list below). If you are interested in having your small molecule screening data analyzed using one or more of these protocols, please contact David Wrobel.

Sample protocols

  • Calculate ALogP: Uses the atom-based method published by Ghose and Crippen to calculate the predicted octanol-water partition coefficient (LogP) and the molar refractivity (MR) for all compounds listed in an input Excel file based on plate and well coordinates.
  • Calculate Molecular Weights: Calculates molecular weights for all compounds listed in an input Excel file based on plate and well coordinates.
  • Cluster Molecules in an Excel File: Compounds from an Excel input file are partitioned into groups of structurally similar compounds (clusters) based on Tanimoto similarities.
  • Cluster Molecules in an SD File: Compounds from an SD input file are partitioned into groups of structurally similar compounds (clusters) based on Tanimoto similarities.
  • Collect Information from Compound Libraries: Collects compound structures and information from ICCB-Longwood small molecule libraries and outputs one file using plate and well names from an input file as identifiers.
  • Combine Excel Worksheets: Combines all Excel worksheets from one workbook to one worksheet.
  • Filter Screen Result Positives: Filters all wells designated as positive in multiple annotated screen result files.
  • Join Data Files: Using a data column common to two input files (the query file and file to match), outputs a file with all records that are present in both (based on the specified column or columns to match). A second output file includes all data in the file to be matched that did not match the query file.
  • Lipinski Filter: Filters input SD file(s) based on Lipinski’s Rule of Five.
  • Molecular Weight Filter: Filters compounds from an input SD file based on the selected molecular weight range.
  • Search Compound Libraries using SMILES Strings: Uses SMILES strings as input to search against the complete ICCB-Longwood compound library database.
  • SMILES Canonicalization: Converts an input list of SMILES strings to the canonical versions, which is useful when attempting to generate a standardized list. Canonical SMILES is a form of SMILES that is independent of how the molecule is drawn and can be used to compare whether two molecules are identical.
  • Structure Similarity Filter: Starting with a specific single compound query structure entered as a SMILES string, identifies structurally similar compounds in the active and retired ICCB-Longwood libraries database.

2. KNIME is an open source program that can be used for data visualization and cheminformatics analysis. It contains many popular cheminformatics packages including RDKit and CDK. The user can create workflows to calculate physicochemical properties, perform similarity searching and perform compound clustering. It has the ability to read a multitude of input files including excel, csv and sdf.  KNIME has similar functionality to Pipeline Pilot. 

3. Data Warrior is a free cheminformatics program for data visualization and analysis. The graphic user interface enables you to view your data as scatter plots, box plots and bar charts, as well as visualize compound structures from SMILES strings provided in the input. Input files can be csv, txt, ode or sdf. Several cheminformatics calculations can be conducted, including compound clustering, similarity searching, automatic SAR analysis and activity cliffs analysis.

4. SwissADME is an online tool to compute physiochemical descriptors, PAINS analysis, and calculate ADME and pharmacokinetic properties. The program relies on compound SMILES strings as the input parameter.

5. ChemOffice software suite is used to manipulate compound structures. ChemOffice includes ChemDraw and ChemFinder. If your institution already holds an academic site license for this suite, you should be able to obtain it for free. ChemOffice is only available for Windows-based computers, but a comparable package (ChemDraw Ultra) is available for Macintosh.

6. SciFinder and Beilstein provide additional and highly useful search capabilities. If you have a PC Windows or Macintosh computer with a Harvard University campus network connection, and you are a current Harvard ID holder, you will be able to download SciFinder and Beilstein from the Harvard Chemistry and Chemical Biology Library Webpage. Although full use of these two programs requires significant chemical expertise, screeners will find it relatively simple to use them to search for commercially available compounds and papers reporting the biological activities of single compounds and their analogs.

7. ChemNavigator.com provides information on compounds and compound structures. Through a collaboration with ChemNavigator, investigators screening at ICCB-Longwood have access to the ChemNavigator database. Please contact David Wrobel for login and password information.

8. Hit2Lead.com supports searches of the ChemBridge collection based on structure, ID number, or compound name. The structure-based search function requires installing plug-ins from MDL and is currently compatible only with Windows. For more information, please see the ChemBridge library page.

9. NCI Developmental Therapeutics Program (DTP) provides an overview of the drug discovery and development branch of the National Cancer Institute.

10. A tutorial on the use of the PubChem database is available. We recommend that you look at “Course Materials” and click on “Essential Exercises” or “PowerTools Exercises” to access the tutorials, a pop-up window will assist you through the steps of each lesson.

11. Medicinal Chemistry Resources: If you are interested in medicinal chemistry support for analysis of top screening hits or for compound follow up, please contact David Scott and The Medicinal Chemistry Core at Dana-Farber Cancer Institute. In addition, there are other medicinal chemistry groups in the Boston-area that have worked with ICCB-Longwood screeners in the past. For more information about these potential resources, please contact Jennifer Smith.

Analysis Software for RNAi Screen Data

1. CARD – Comprehensive Analysis of RNAi Data was developed by Bhaskar Dutta and colleagues (Nature Communications, doi: 10.1038/ncomms10578) at the National Institute of Allergy and Infectious Diseases. It is a comprehensive web-application for integrated analysis and interactive visualization of RNAi screening data. CARD combines both existing and novel algorithms for data pre-processing, reducing false positive hits through gene expression and off-target filtering, implementing network/pathway enrichment of high-confidence hits and predicting active miRNAs. It is possible to load ICCB-Longwood formatted siRNA screening data into CARD, please contact Jennifer Smith or David Wrobel to upload your data.

2. Horizon Discovery Seed Analysis Tool can be used to identify siRNAs that may be acting via off-target mechanisms on genes that are true hits in a RNAi screen. The tool compares the sequences of siRNA seed regions in two different lists: 1) a list of siRNAs that score as hits in the primary RNAi screen; and 2) a list of siRNAs that do NOT score as hits in the screen. The program will flag siRNAs with seed regions enriched in the hit list relative to the non-hit list.

Note: This service is no longer available from Dharmacon: ICCB-Longwood staff will submit data to Dharmacon for analysis and will return results to screeners. To format your data to submit for seed tool analysis:

  • Preferred formats are txt, csv, or xls files showing one catalog number (or sense sequence) per line, with hits and non-hits in separate files or on separate worksheets
  • The tool takes either catalog numbers (either pool or duplex catalog numbers) or 19mer sense sequences.
  • It is best to provide two lists: a list of hits and a list of non-hits. The list of non-hits must be as long or longer than the list of hits. Ideally the non-hit list would be at least three to four times longer than the hit list.
  • The software will then randomly select a set of non-hits as large as the set of hits and examine the seed properties of this control set to give you a sense of the “background” level of seed frequencies due to things like sequence composition, algorithmic biases, etc.
  • Please specify organism 
  • Contact David Wrobel or Jennifer Smith to submit lists to Dharmacon.

3. Online GESSGenome-wide Enrichment of Seed Sequence matches (GESS) is a bioinformatics analysis algorithm that was developed by Frederic Sigoilot and Randy King (Nat Methods. Feb 19;9(4):363-6.) to identify potential off-targeted transcripts based upon analysis of primary RNAi screening data. An online version of GESS was developed by the Drosophila RNAi Screening Core here at Harvard Medical School.

4. Haystack Analysis Web Server was developed by Eugen Buehler and colleagues (Sci. Rep. 2, 428; DOI:10.1038/srep00428.) to analyze results from RNAi screens that are normally distributed.  The analysis is based upon off-target effects. Note that this analysis tool appears to no longer be available fron NCATS.

5. C911 Calculator is an experimental strategy to determine whether an siRNA duplex is producing the desired phenotype because of on-target or miRNA-like off-target effects (PLoS ONE 7(12): e51942.). Note that this analysis tool appears to no longer be available fron NCATS.

Please contact Jennifer Smith or David Wrobel if you have questions about how to submit a list to or interpret results from any of the above analysis tools.