Library Menu

Databases & Protocols

graphic of PubMed logo

graphic of Sciverse Scopus logo

graphic of Clarivate Analytics logo

graphic of HS talks logo

graphic of The Arabidopsis Information Resource logo

graphic of National Institutes of Health logo

graphic of ACM Digital Library logo

graphic of giri repbase logo

graphic of IOP Science logo

graphic of journal of visualized experiments logo

SciFinger-n logo

Bibliographic Databases

Bioinformatics & Genomics Databases

Cancer Databases

Chemical & Biochemical Databases

Neuroscience Databases

Plant Biology Databases

QB Databases


The following list of both commercial and open access databases allow you to search for published papers based upon author, title, keyword and journal. More specialized databases are listed under CSHL’s Five Disciplines.

Academic Search (Microsoft)

Microsoft Academic Search generates web pages with ranked objects for academic world. Users can use this page to discover influential papers, authors, conferences, journals, keywords, and organizations within their field.

Search results are sorted based on two factors: their relevance to the query and their global importance, which is calculated by its relationships with other objects.

BIOBASE Knowledge Library (BKL) Proteome

BKL PROTEOME™ is a systems biology database and analysis tool built on manually-curated protein, disease, drug, pathway, and model organism details from the PubMed literature. This powerful query system, with specialized tools for gene set analysis and pathway visualization, allows scientists to quickly find answers to questions relevant to their research.


Resources for the life sciences information community

Collection of Computer Science Bibliographies

This is a collection of bibliographies of scientific literature in computer science from various sources, covering most aspects of computer science. The bibliographies are updated weekly from their original locations such that you’ll always find the most recent versions here.

The collection currently contains more than 7 millions of references (mostly to journal articles, conference papers and technical reports), clustered in about 1500 bibliographies, and consists of more than 2.3 GBytes (530MB gzipped) of BibTeX entries. More than 600 000 references contain crossreferences to citing or cited publications.

More than 1 million of references contain URLs to an online version of the paper. Abstracts are available for more than 800 000 entries. There are more than 2000 links to other sites carrying bibliographic information.

Current Content Connect, Life Sciences (1998 – present)

Complete tables of contents and bibliographic information of journals and books from WEB OF KNOWLEDGE. Includes evaluated Web sites and pre-published electronic journal articles. Publisher: Thomson Reuters.

Ebsco Databases – from Novel (New York Online Virtual Electronic Library)

Full text Magazines, General Science Collection, Searchasaurus, EBSCOhost Espanol, etc. For Remote Access instructions, go to Remote Ebsco Databases


Covers over 750 periodicals and nearly a half a million documents. Fulltext documents since 1993

Faculty Opinions

Highlights and reviews the most interesting papers published in the biological sciences, based on the recommendations of a faculty of well over 1000 selected leading researchers.

First Search

OCLC: Online Computer Library Center. 14 Databases. Includes extensive Biomedical content, WorldCat, world-wide library catalog of books and other materials.

Gale Databases – from NovelNY (New York Online Virtual Electronic Library)

Business and Company Resource Center, including Business ASAP, New York State and National Newspaper Indexes, Health and Wellness Resource Center.

GenBank – From NCBI

overview, submit sequences, submit genomes, sample record, GenBank divisions, statistics, release notes, international collaboration, FTP GenBank

Google Scholar

Google Scholar provides a simple way to broadly search for scholarly literature. From one place, you can search across many disciplines and sources: articles, theses, books, abstracts and court opinions, from academic publishers, professional societies, online repositories, universities and other web sites. Google Scholar helps you find relevant work across the world of scholarly research.

Henry Stewart Talks

A collection of online seminars by world leading scientists. From basic to advanced topics in the Biomedical and Life Sciences.

Journal Citation Reports (JCR)

Journal performance metrics, including Impact Factor


Provides searchable full-text of historical runs of important scholarly journals in the humanities, arts, sciences, ecology, and business.

Knovel Scientific and Engineering Online References

A collection of over 600 handbooks, dictionaries and proceedings in technical disciplines. Full text is searchable by keywords and numeric property values. Contains interactive tables, graphs, and equations. Data from tables can be manipulated and further imported in an Excel spreadsheet.

Literature Databases – From NCBI

PubMed, PubMedCentral, Journals, OMIM, Books, Citation Matcher

LocatorPlus – From NCBI

Advanced searching through all of NLM’s book, journal and audiovisual titles

Molecular Databases – From NCBI

nucleotides, proteins, structures, genes, gene expression, taxonomy


Cross-database search of biomedical and life sciences literature and data. Coverage: NCBI’s 37 databases, from PubMed to genes to proteins

NCBI – Literature

Bibliographic descriptions and full text of biomedical and life sciences literature. Coverage: 10 databases, ranging from Bookshelf to PubMed Central.


Bibliographic descriptions of biomedical and life sciences literature with links to full-text content from PubMed Central and publisher websites. Coverage: More than 21 million citations for biomedical literature from MEDLINE, life science journals, and online books.

PubMed Central

The U.S. National Institutes of Health (NIH) free digital archive of biomedical and life sciences journal literature.


Electronic access to journals published by Elsevier Science and Academic Press Coverage: varies; includes current subscriptions and some back files


Scite is an award-winning platform for discovering and understanding scientific articles via Smart Citations. Smart Citations allows users to see how a scientific paper has been cited by providing the context of the citation and a classification describing whether it provides supporting or contrasting evidence for the cited claim.


Scopus (from the Reed Elsevier Group) offers more coverage of scientific, technical, medical and social science literature than any other database. Also indexes Patents and Websites. Coverage: abstracts back to 1966.

Web of Science

Access the world’s leading scholarly literature in the sciences, social sciences, arts, and humanities and examine proceedings of international conferences, symposia, seminars, colloquia, workshops, and conventions. Navigate with cited reference searching and Author Finder. Create a visual representation of citation relationships with Citation Mapping. Use the Analyze Tool to identify trends and patterns. Edition: Science Citation Index Expanded (1955-present) Publisher: Thomson Reuters


Thousands of libraries are represented and searchable in this one catalog.

WorldCat Dissertations

All dissertations, theses and published material based on theses cataloged by OCLC members


The Biology and Genome of C. elegans

1000 Genomes Project

The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied. This goal can be attained by sequencing many individuals lightly. To sequence a person’s genome, many copies of the DNA are broken into short pieces and each piece is sequenced. The many copies of DNA mean that the DNA pieces are more-or-less randomly distributed across the genome. The pieces are then aligned to the reference sequence and joined together. To find the complete genomic sequence of one person with current sequencing platforms requires sequencing that person’s DNA the equivalent of about 28 times (called 28X). If the amount of sequence done is only an average of once across the genome (1X), then much of the sequence will be missed, because some genomic locations will be covered by several pieces while others will have none. The deeper the sequencing coverage, the more of the genome will be covered at least once. Also, people are diploid; the deeper the sequencing coverage, the more likely that both chromosomes at a location will be included. In addition, deeper coverage is particularly useful for detecting structural variants, and allows sequencing errors to be corrected.


AceDB is a genome database system developed since 1989 primarily by Jean Thierry-Mieg (CNRS, Montpellier) and Richard Durbin (Sanger Institute). It provides a custom database kernel, with a non-standard data model designed specifically for handling scientific data flexibly, and a graphical user interface with many specific displays and tools for genomic data. AceDB is used both for managing data within genome projects, and for making genomic data available to the wider scientific community.

AceDB was originally developed for the C.elegans genome project , from which its name was derived: AC. elegans DataBase. However, the tools in it have been generalized to be much more flexible and the same software is now used for many different genomic databases from bacteria to fungi to plants to man. It is also increasingly used for databases with non-biological content.

ACM Digital Library

The ACM Digital Library (DL) is the most comprehensive collection of full-text articles and bibliographic records in existence today covering the fields of computing and information technology. The full-text database includes the complete collection of ACM’s publications, including journals, conference proceedings, magazines, newsletters, and multimedia titles and currently consists of:

  • 407,367 Full-text articles
  • 2.0+ Million Pages of full-text articles
  • 18,000+ New full-text articles added each year
  • 44+ High Impact Journals with 2-3 new journals being launched each year
  • 275+ Conference Proceedings Titles added each year
  • 2,000+ Proceedings Volumes
  • 8 Magazines (including the flagship Communications of the ACM, the most heavily cited publication in the field of computing according to Thomson-Reuters)
  • 37 Technical Newsletters from ACM’s Special Interest Groups (SIGs)
  • 6,500+ Video files
  • 594 Audio files

In addition to the full-text database, the ACM Digital Library is heavily integrated with and includes unrestricted access to the Guide to Computing Literature bibliography.

The ACM Digital Library includes reference linking though CrossRef, integration with the ACM Computing Reviews database, index terms using ACM’s 2012 Computing Classification Scheme (CCS), alerting and TOC services, and all export formats including BibTex, Endnote, and ACM Ref, as well as OpenURL compliance, and COUNTER III and SUSHI Compliant usage statistics.

Allen Brain Atlas

The Allen Brain Atlas resources are a growing collection of online public resources integrating extensive gene expression and neuroanatomical data, complete with a novel suite of search and viewing tools. This portal gives you access to each of these resources by clicking on the button for a particular project or by clicking the project from the banner tab or drop-down menu.


ARTS: Accurate Recognition of Transcription Starts in Human

(now at cBio@mskcc)

Started in August 1991, (formerly is a highly-automated electronic archive and distribution server for research articles. Covered areas include physics, mathematics, computer science, nonlinear sciences, quantitative biology and statistics. arXiv is maintained and operated by the Cornell University Library with guidance from the arXiv Scientific Advisory Board and the arXiv Member Advisory Board, and with the help of numerous subject moderators.

Berkeley Drosophila Genome Project (BDGP) at Lawrence Berkeley National Laboratory

The Berkeley Drosophila Genome Project (BDGP) is a consortium of the Drosophila Genome Center, funded by the National Human Genome Research Institute, National Cancer Institute, and Howard Hughes Medical Institute, through its support of work in the Gerald Rubin, Allan Spradling, Roger Hoskins, Hugo Bellen, Susan Celniker, and Gary Karpen laboratories.

The goals of the Drosophila Genome Center are to finish the sequence of the euchromatic genome of Drosophila melanogaster to high quality and to generate and maintain biological annotations of this sequence. In addition to genomic sequencing, the BDGP is 1) producing gene disruptions using P element-mediated mutagenesis on a scale unprecedented in metazoans; 2) characterizing the sequence and expression of cDNAs; and 3) developing informatics tools that support the experimental process, identify features of DNA sequence, and allow us to present up-to-date information about the annotated sequence to the research community.

BIOBASE Knowledge Library (BKL) Proteome

C. elegans Gene Expression Consortium

The objective of this project is to define the RNA expression profiles in specific tissues and cells, and developmental stages of C. elegans. Two complementary approaches are being applied: serial analysis of gene expression (SAGE), and the construction of promoter: GFP fusions for in vivo analysis of gene expression.

SAGE is a sensitive and specific method for obtaining qualitative and quantitative information on expressed RNAs. SAGE will also allow us to identify non-protein-coding genes, and provide insight into alternatively spliced mRNA isoforms and their relative abundance between tissues.

We are examining total mRNA populations in all developmental stages, both in whole worms and in specific cells and tissues. We have generated 17 SAGE libraries, which include all developmental stages, mutation-specific populations, and specific tissues and cells, totalling approximately 1.8 million observed tags. Tissue- and cell-specific libraries were generated from FACS-sorted cells marked by expression of specific promoter::GFP fusions. To date, we have SAGE libraries for purified embryonic muscle, gut, and a subset of neurons.

Promoter::GFP fusions
Monitoring in vivo expression of the fusion constucts in transgenic worms allows the determination of the developmental stage, tissue, and in some cases, the cells where a particular gene is expressed. Our goal is to build promoter::GFP fusion constructs for C. elegans genes that have human orthologues. Of the over 5000 genes that fall into this category, 2000 are being targeted by the C. elegans Gene Knockout Consortium. Fusion constructs are being created for the same set of genes, with a focus on genes expressed in muscle and nerve tissues. When coupled with SAGE and knockout data, this will provide valuable and more complete expression profiles for cells, tissues, and developmental stages.

Cancer Gene Census

All cancers arise as a result of the acquisition of a series of fixed DNA sequence abnormalities, mutations, many of which ultimately confer a growth advantage upon the cells in which they have occurred. There is a vast amount of information available in the published scientific literature about these changes. COSMIC is designed to store and display somatic mutation information and related details and contains information relating to human cancers.

Types of data

There are two types of data in COSMIC: Expert manual curation data and systematic screen data. It is useful to understand the differences of these data types and use them appropriately.

Expert curation data

  1. Manually input from peer reviewed publications by COSMIC expert curators
  2. Consists of comprehensive literature curation of selected Census genes at release, followed by subsequent updates (Cancer Gene Census)
  3. Includes additional data points relevant to each disease and publication
  4. Provides accurate frequency data as mutation negative samples are specified
  5. Also called non-systematic or targeted screen data

Genome-wide screen data

  1. Uploaded from publications reporting large scale genome screening data or imported from other databases such as TCGA and ICGC
  2. Provides unbiased molecular profiling of diseases while covering the whole genome
  3. Provides objective frequency data by interpreting non mutant genes across each genome
  4. Facilitates finding novel driver genes in cancer

Cancer Genome Atlas

There are at least 200 forms of cancer, and many more subtypes. Each of these is caused by errors in DNA that cause cells to grow uncontrolled. Identifying the changes in each cancer’s complete set of DNA—its genome—and understanding how such changes interact to drive the disease will lay the foundation for improving cancer prevention, early detection and treatment.

The Cancer Genome Atlas (TCGA) began as a three-year pilot in 2006 with an investment of $50 million each from the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI). The TCGA pilot project confirmed that an atlas of changes could be created for specific cancer types. It also showed that a national network of research and technology teams working on distinct but related projects could pool the results of their efforts, create an economy of scale and develop an infrastructure for making the data publicly accessible. Importantly, it proved that making the data freely available would enable researchers anywhere around the world to make and validate important discoveries. The success of the pilot led the National Institutes of Health to commit major resources to TCGA to collect and characterize more than 20 additional tumor types.

Learn more about the important role of tissue samples to TCGA.

Each cancer will undergo comprehensive genomic characterization and analysis. The comprehensive data that have been generated by TCGA’s network approach are freely available and widely used by the cancer community through the TCGA Data Portal and the Cancer Genomics Hub (CGHub).

Learn more about the components of the TCGA Research Network by selecting a link below:

Biospecimen Core Resource (BCR)  – Tissue samples are carefully cataloged, processed, checked for quality and stored, complete with important medical information about the patient.

Genome Characterization Centers (GCCs) – Several technologies will be used to analyze genomic changes involved in cancer. The genomic changes that are identified will be further studied by the Genome Sequencing Centers.

Genome Sequencing Centers (GSCs) – High-throughput Genome Sequencing Centers will identify the changes in DNA sequences that are associated with specific types of cancer.

Proteome Characterization Centers (PCCs) – The centers, a component of NCI’s Clinical Proteomic Tumor Analysis Consortium, will ascertain and analyze the total proteomic content of a subset of TCGA samples.

Data Coordinating Center (DCC) – The information that is generated by TCGA will be centrally managed at the DCC and entered into the TCGA Data Portal and Cancer Genomics Hub as it becomes available. Centralization of data facilitates data transfer between the network and the research community, and makes data analysis more efficient. The DCC manages the TCGA Data Portal. 

Cancer Genomics Hub (CGHub) – Lower level sequence data will be deposited into a secure repository. This database stores cancer genome sequences and alignments.

Genome Data Analysis Centers (GDACs) – Immense amounts of data from array and second-generation sequencing technologies must be integrated across thousands of samples. These centers will provide novel informatics tools to the entire research community to facilitate broader use of TCGA data.

Cancer Genome Project

CoCoMac 2

The macaque macroconnectivity database CoCoMac has been initiated and built up by prof. Rolf Kötter, first at the C. & O. Vogt Institute of Brain Research at the Henrich Heine University in Düsseldorf, later at the Donders Institute for Brain, Cognition and Behaviour at the Radboud University Nijmegen, ending with a brief period at the Jülich Research Institute. While working on a major overhaul of the database and connectivity mapping engine, Rolf was struck by a tumor, which took his life after three years of battling with the disease. He passed away on June 9th 2010.

The ongoing work on the database has been adopted by the German INCF node (G-Node) and the Computational and Systems Neuroscience group of the Juelich Research Institute. The new database engine features an extensive search wizard and interactive browser. It also powers the Scalable Brain Atlas visual connectivity tool. A web-based data entry system is under development. Contact person for the current developments is Dr. Rembrandt Bakker.

Cold Spring Harbor Mammalian Promoter Database

In the post-genome era, characterization of gene regulation networks has become an important part of genomic research. To succeed in such studies in any organism, a high-quality and comprehensive database of genes and their promoters, transcription factor binding sites, and other cis-regulatory elements is much desired if not a must.

Cold Spring Harbor Laboratory mammalian promoter database (CSHLmpd) used all known transcripts, integrating with predicted transcripts, to construct gene set of human, mouse and rat genomes. For promoter information, we collected known promoter information from multiple resources, together with predicted ones. These promoters were mapped to genome, and linked to related genes. We also compared promoters of orthologous gene groups to detect the sequence conservation in promoter regions.

We expect CSHLmpd to be helpful for research in gene regulation networks by providing guidance for experimental studies such as DNA microarray and chromatin IP. It will also facilitate the building of a foundation upon which we expand our insights into the structure of mouse genome through continued data collection, intelligent data analysis and integration.

Copy Number Variation Project

Genetic diseases are caused by mutations in DNA sequences. The Copy Number Variation (CNV) Project investigates the impact on human health of CNVs – gains and losses of large chunks of DNA sequence consisting of between ten thousand and five million letters. We already know that many inherited genetic diseases result from structural mutations or CNVs; we also know that there are Copy Number Variants that protect against HIV infection and malaria. The contribution of CNV to the common, complex diseases, such as diabetes and heart disease, is currently less well understood.

More pages on the CNV project:

Database of Genomic Variants

A curated catalogue of human genomic structural variation

The objective of the Database of Genomic Variants is to provide a comprehensive summary of structural variation in the human genome. We define structural variation as genomic alterations that involve segments of DNA that are larger than 50bp. The content of the database is only representing structural variation identified in healthy control samples.

The Database of Genomic Variants provides a useful catalog of control data for studies aiming to correlate genomic variation with phenotypic data. The database is continuously updated with new data from peer-reviewed research studies. We always welcome suggestions and comments regarding the database from the research community.

For data sets where the variation calls are reported at a sample by sample level, we merge calls with similar boundaries across the sample set. Only variants of the same type (i.e. CNVs, inversions) are merged, and gains and losses are merged separately. In addition, if several different platforms/approaches are used within the same study, these datasets are merged separately. Sample level calls that overlap by >= 70% are merged in this process.

Database of Transcriptional Start Sites

o support transcriptional regulation studies, we have constructed the DBTSS (DataBase of Transcriptional Start Sites), which represents exact positions of transcriptional start sites (TSSs) in the genome based on our unique experimentally validated TSS sequencing method, TSS-seq.

This database includes TSS data of a major part of human adult and embryonic tissues are covered. DBTSS now contains 491 million TSS tag sequences for collected from a total of 20 tissues and 7 cell cultures. We also integrated our newly generated RNA-seq data of subcellular- fractionated RNAs and ChIP-seq data of histone modifications, RNA polymerase II and several transcriptional regulatory factors in cultured cell lines. We also included recently accumulating external epigenomic data, such as chromatin map of the ENCODE project.

In this update, we further associated those TSS information with public and original SNV data, in order to identify single nucleotide variations (SNVs) in the regulatory regions.

It is believed that single nucleotide variations (SNVs) in the transcriptional regulatory regions are responsible for many human diseases, including cancers. However, it remains difficult to identify functionally relevant SNVs from those having no explicit biological consequences. In this version of DBTSS, we attempt to associate SNVs with the omics information of the surrounding regions. We used SNVs which we identified from genomic analyses of various types of cancers, including somatic mutations of 100 lung adenocarcinoma and lung small cell carcinoma. For germline variations, we used SNVs in dbSNP as well as our unique dataset of variations in 1000 Japanese individuals. We integrated those SNV information with our original datasets of TSS-seq, RNA-seq, ChIP-seq of representative histone modifications and Bisulfite Sequencing of cytosine methylations of DNA. Particular, we present multi-omics data of 26 lung adenocarcinoma cells line for which TSS-seq, RNA-seq, ChIP-seq and BS-seq together with whole genome sequencing are collected from the same materials. We further connected the multi-omics data of model organisms by genome-genome alignment. We provide a unique data resource to investigate what genomic features are observed in a particular genomic coordinates in a wide variety of samples.

These data can be browsed in our new viewer which also supports versatile search conditions of users. We believe new DBTSS is helpful to understand biological consequences of the massively identified TSSs and identify human genetic valuations which are associated with disordered transcriptional regulations.


dbEST (Nature Genetics 4:332-3;1993) is a division of GenBank that contains sequence data and other information on “single-pass” cDNA sequences, or “Expressed Sequence Tags”, from a number of organisms.

ENCODE Project

The National Human Genome Research Institute (NHGRI) launched a public research consortium named ENCODE, the Encyclopedia Of DNA Elements, in September 2003, to carry out a project to identify all functional elements in the human genome sequence. The project started with two components – a pilot phase and a technology development phase.

The pilot phase tested and compared existing methods to rigorously analyze a defined portion of the human genome sequence (See: ENCODE Pilot Project). The conclusions from this pilot project were published in June 2007 in Nature and Genome Research. The findings highlighted the success of the project to identify and characterize functional elements in the human genome. The technology development phase also has been a success with the promotion of several new technologies to generate high throughput data on functional elements.

With the success of the initial phases of the ENCODE Project, NHGRI funded new awards in September 2007 to scale the ENCODE Project to a production phase on the entire genome along with additional pilot-scale studies. Like the pilot project, the ENCODE production effort is organized as an open consortium and includes investigators with diverse backgrounds and expertise in the production and analysis of data (See: ENCODE Participants and Projects). This production phase also includes a Data Coordination Center [] to track, store and display ENCODE data along with a Data Analysis Center to assist in integrated analyses of the data. All data generated by ENCODE participants will be rapidly released into public databases (See: Accessing ENCODE Data) and available through the project’s Data Coordination Center.


The Ensembl project was started in 1999, some years before the draft human genome was completed. Even at that early stage it was clear that manual annotation of 3 billion base pairs of sequence would not be able to offer researchers timely access to the latest data. The goal of Ensembl was therefore to automatically annotate the genome, integrate this annotation with other available biological data and make all this publicly available via the web. Since the website’s launch in July 2000, many more genomes have been added to Ensembl and the range of available data has also expanded to include comparative genomics, variation and regulatory data.

The number of people involved in the project has also steadily increased. Currently, the Ensembl group consists of between 40 and 50 people, divided in a number of teams. The Genebuild team creates the gene sets for the various species. The result of their work is stored in the core databases, which are taken care of by the Software team. This team also develops and maintains the BioMart data mining tool. The Compara, Variation and Regulation teams are responsible for the comparative and the variation and regulatory data, respectively. The Web team makes sure that all data are presented on the website in a clear and user-friendly way. Finally the Outreach team answers questions from users and gives workshops worldwide about the use of Ensembl. The Ensembl project is headed by Paul Flicek and receives input from an independent scientific advisory board.

Ensembl is a joint project between European Bioinformatics Institute (EBI), an outstation of the European Molecular Biology Laboratory (EMBL), and the Wellcome Trust Sanger Institute (WTSI). Both institutes are located on the Wellcome Trust Genome Campus in Hinxton, south of the city of Cambridge, United Kingdom.


The Gene database provides detailed information for known and predicted genes defined by nucleotide sequence or map position. Currently, Gene contains more than 14 million entries and includes data from all major taxonomic groups. Each record in the database corresponds to a single gene and is derived from processing by the NCBI Reference Sequence and genome annotation groups. Gene data can be accessed on the web through the Gene home page, programmatically through the Entrez Programming Utilities, or by file transfer through its FTP site.

EPD – Eukaryotic Promoter Database

This resource allows the access to several databases of experimentally validated promoters: EPD and EPDnew databases. They differ by the validation technique used and the coverage. EPD is a collection of eukaryotic promoters derived from published articles. Instead, the EPDnew databases (HT-EPD) are the result of merging EPD promoters whith in-house analysis of promoter-specific high-throughput data for selected organisms only. This process gives EPDnew high precision and high coverage.

The Eukaryotic Promoter Database is an annotated non-redundant collection of eukaryotic POL II promoters, for which the transcription start site has been determined experimentally. Access to promoter sequences is provided by pointers to positions in nucleotide sequence entries. The annotation part of an entry includes description of the initiation site mapping data, cross-references to other databases, and bibliographic references. EPD is structured in a way that facilitates dynamic extraction of biologically meaningful promoter subsets for comparative sequence analysis. This database contains 4806 promoters from several species.

EPDnew is a new collection of experimentally validated promoters in human, mouse, D. melanogaster and zebrafish genomes. Evidence comes from TSS-mapping from high-throughput expreriments such as CAGE and Oligocapping. ChIP-seq experiments on H2AZ, H3K4me3, Pol-II and DNA methylation are also taken into account during the analysis. The resulting database contains 23360 promoters for the human (H. sapiens) collection, 21239 promoters for the mouse (M. musculus) collection, 15073 promoters for the D. melanogaster collection, 10728 promoters for the zebrafish (D. rerio) collection, 7120 promoters for the worm (C. elegans) collection and 10229 promoters for the A. thaliana collection.

European Conditional Mouse Mutagenesis Programme

Fantom Functional Annotation of Mouse

Fantom Functional Annotation of the Mammalian Genome




Gene Set Enrichment Analysis (GSEA)

Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states (e.g. phenotypes).

NCBI Structure Group

NIH Knockout Mouse Project


Oryza Map Alighment Project





Repbase is a database of prototypic sequences representing repetitive DNA from different eukaryotic species. Repbase is being used in genome sequencing projects worldwide as a reference collection for masking and annotation of repetitive DNA (e.g. by RepeatMasker or CENSOR).

Rice Annotation Project (RAP) Database



TCat: The Catalog of Tissue-Specific Regulatory Motifs


UCSC Genome Browser


OBRC: Online Bioinformatics Resources Collection

Cancer Gene Census

The Cancer Gene Census is an ongoing effort to catalogue those genes for which mutations have been causally implicated in cancer. The census is not static but rather is updated regularly/as needed. In particular we are grateful to Felix Mitelman and his colleagues in providing information on more genes involved in uncommon translocations in leukaemias and lymphomas. Currently, more than 1% of all human genes are implicated via mutation in cancer. Of these, approximately 90% have somatic mutations in cancer, 20% bear germline mutations that predispose to cancer and 10% show both somatic and germline mutations.

Cancer Genome Atlas

About Cancer Genomics provides information to educate readers about cancer genomics, the importance of tissue samples in cancer genomics and how understanding cancer genomics is changing the way we approach cancer diagnosis and treatment.

Cancer Genome Project

The Cancer Genome Project is using the human genome sequence and high throughput mutation detection techniques to identify somatically acquired sequence variants/mutations and hence identify genes critical in the development of human cancers (see here for a description of our strategy). This initiative will ultimately provide the paradigm for the detection of germline mutations in non-neoplastic human genetic diseases through genome-wide mutation detection approaches.

Epistemic AI

Like an AI-enabled research assistant, Knowledge Maps gather documents and data according to explicit criteria, enabling the user to find connections across the biomedical universe. The Knowledge Map is the core of our platform and satisfies many use cases, including: Basic research, Translational research, Literature review, and Hypotheses generation.

This tool can identify pathways that are enriched by and overlap with one or more genes, and provides analysis and statistics across multiple databases, and is species-specific. The added ability to connect gene variants (SNVs, Indels, Deletions, CNAs) with known cancer biomarkers, allows for: Gene enrichment analysis, Pathway identification, Variant analysis, and Clinical interpretation


Reaxys is a unique web-based chemistry database consisting of deeply excerpted compounds and related factual properties, reaction and synthesis information as well as bibliographic data, navigated and displayed via an actionable interface.


SciFinder, produced by Chemical Abstracts Service (CAS), is the most comprehensive database for the chemical literature, searchable by topic, author, substances by name or CAS Registry Number, OR use the editor to draw chemical structures, substructures, or reactions.


via USDA – provides citations to agricultural literature


The broad mission of ChromDB is display, annotate, and curate sequences of two broad functional classes of biologically important proteins: chromatin-associated proteins (CAPs) and RNA interference-associated proteins. Plant proteins are the major focus of the work support by The Plant Genome Research Program (PGRP) of the National Science Foundation. Our intent is to produce intensively curated sequence information and make it available to the research and teaching community in support of comparative analyses toward understanding the chromatin proteome in plants, especially in important crop species. In order to take do a comparative analysis, it is necessary to include non-plant proteins in the database. Non-plant genes are not curated to the degree carried out for plants and to automate the process of data import, our non-plant genes are from the RefSeq database of NCBI. We reason that the inclusion of non-plant, model organisms will broaden the relevance and usefulness of ChromDB to the entire chromatin community and will provide a more complete data set for phylogenetic analyses in support of the evolution of the plant chromatin proteome.

Gramene Project

Extensive research over the past two decades has shown there is a remarkably consistent conservation of gene order within large segments of linkage groups in agriculturally important grasses such as rice, maize, sorghum, barley, oats, wheat, and rye. Grass genomes are substantially colinear at both large and short scales with each other, opening the possibility of using syntenic relationships to rapidly isolate and characterize homologues in maize, wheat, barley and sorghum.

As an information resource, Gramene’s purpose is to provide added value to data sets available within the public sector, which will facilitate researchers’ ability to understand the grass genomes and take advantage of genomic sequence known in one species for identifying and understanding corresponding genes, pathways and phenotypes in other grass species. This is achieved by building automated and curated relationships between cereals for both sequence and biology. The automated and curated relationships are queried and displayed using controlled vocabularies and web-based displays. The controlled vocabularies (Ontologies), currently being used include Gene ontology, Plant ontology, Trait ontology, Environment ontology and Gramene Taxonomy ontology. The web-based displays for phenotypes include the Genes and Quantitative Trait Loci (QTL) modules. Sequence based relationships are displayed in the Genomes module using the genome browser adapted from Ensembl, in the Maps module using the comparative map viewer (CMap) from GMOD, and in the Proteins module displays. BLAST is used to search for similar sequences. Literature supporting all the above data is organized in the Literature database.

MAGI (Maize Assembled Genomic Island)

The MAGI website summarizes some of our investigations of the maize genome.

The MAGI website have assembled gene-enriched (MF and HC; Whitelaw et al., 2003; Palmer et al., 2003) and random Whole Genome Shotgun (WGS) GSSs (Genome Survey Sequences) of maize and sorghum into MAGIs (Maize Assembled Genomic Islands) and SAMIs (Sorghum Assembled genoMic Islands), respectively. Based on computational and biological quality assessments it appears that a very high percentage of genic MAGIs and SAMIs accurately reflect the structures of the maize (Fu et al., 2005) and sorghum genomes. We have similarly assembled maize ESTs into MECs (maize expressed contigs).

It is possible to Blast MAGIs, 454-ESTs, MECs, SAMIs and the 16,819 B73 maize BACs that as of 10/09/2009 have been at least partially sequenced by the maize genome sequencing project ( DBI-0527192; Rick Wilson, PI). MAGIs have been annotated via sequence similarity, repeats, alignments to Sanger and 454 ESTs, and using an ab-initio gene prediction tool. A repeatmasker is available to facilitate primer design and annotation.

Our latest genetic map, ISU_IBM Map7, contains ~6,000 genic markers integrated with ~4,000 additional markers from other projects. This map has been fully integrated into the MAGI web site. It is possible to blast sequences against the ~6,000 sequence-defined, genic, genetic markers generated by the ISU maize mapping project.

Maize Full-Length cDNA Project

This project will span three years and involve two academic institutions: the University of Arizona and Stanford. The overall goal is to sequence 30,000 FLcDNA clones from two cDNA libraries of varied tissues and stress treatments. This project is using maize inbred B73 background for both clone libraries, the same inbred line being used for full genome sequencing. Specifically, the supporting aims of this project are:

  1. Sequence 5′ and 3′ ESTs from 130,000 random cDNA clones in library #1.
  2. Sequence 5′ and 3′ ESTs from 50,000 random cDNA clones in library #2.
  3. Select ~30,000 unique clones with both a 5′ and 3′ EST for full-length sequencing from libraries #1 and #2.
  4. Annotate the expression of these FLcDNAs using microarray hybridizations, locate FLcDNAs on the physical map of maize chromosomes, and display results using a web-based genome browser.
  5. Distribute clones and amplified FLcDNA libraries to the research community.
  6. Involve high school teachers and undergraduates in genomics projects and analysis; develop classroom exercises using maize genomics resources.


MaizeGDB is a community-oriented, long-term, federally funded informatics service to researchers focused on the crop plant and model organism Zea mays.

mips RE-dat

The MIPS plant genomics group focuses on the analysis of plant genomes, using bioinformatic techniques. To store and manage the data, we developed a database, PlantsDB, that aims to provide a data and information resource for individual plant species. In addition, PlantsDB provides a platform for integrative and comparative plant genome research. Currently, PlantsDB provides the following databases:

  • The Triticeae genome project
  • The maize genome database (MGSP)
  • The rice genome database (MOsDB)
  • The sorghum genome database
  • The brachypodium genome database
  • The MIPS Arabidopsis thaliana genome database
  • The Medicago truncatula genome database
  • The Lotus japonicus genome database
  • The tomato genome database
  • mips Repeat Element database (mips-REdat) mips Repeat Element catalog (mips-REcat)
  • MotifDB
  • Plasmar

Oryza Map Alignment Project (OMAP)

The Golden path to unlocking the genetic potential of Wild Rice Species.

The long term goal of this project is to develop an experimentally tractable and closed model system to globally unravel and understand the evolution, physiology and biochemistry of the genus Oryza.

The specific objectives of this proposal are to:

  • Construct DNA fingerprint/BAC-end sequence physical maps from 11 deep coverage BAC libraries that represent the 11 wild genomes of Oryza (830,000 fingerprints; 1,659,000 BAC ends)
  • align the 11 physical maps with the sequenced reference subspecies japonica and indica.
  • construct high-resolution physical maps of rice chromosomes 1, 3 and 10 across the 11 wild genomes using a combination of hybridization and in silico anchoring strategies.
  • provide convenient bioinformatics research and educational tools (FPC and web-based) to rapidly access and understand the collective Oryza genome.


Panzea is the bioinformatics arm of a project investigating the Genetic Architecture of Maize and Teosinte (NSF 0820619). The project is funded by the National Science Foundation.

The project is describing the genetic architecture of complex traits in maize and teosinte. We will identify genes that control domestication traits and three key agronomic traits: flowering time, plant height, and kernel quality. We will characterize allelic series at these genes, examine their epistatic and environmental interactions, and take a step toward the ultimate goal of predicting phenotype from genotype. The genetic, germplasm, and bioinformatic resources created by this project will help maize researchers worldwide to discover the genetic basis of any trait of interest.

The Panzea website provides access to the project database and bioinformatics module.

Rice Annotation Project Database

The Rice Annotation Project (RAP) was conceptualized upon the completion of the rice genome sequencing in 2004 with the aim of providing the scientific community with an accurate and timely annotation of the rice genome sequence. One of the major activities of RAP is to hold jamboree-style annotation meetings on a regular basis to facilitate the manual curation of all gene structures and functions in rice. Also part of the overall objective is to facilitate a comprehensive analysis of the sequence based on the results of annotation and the construction of a public database.


The Arabidopsis Information Resource (TAIR) collects information and maintains a database of genetic and molecular biology data for Arabidopsis thaliana, a widely used model plant. TAIR is located at the Carnegie Institution for Science Department of Plant Biology, Stanford, California.

Wheat SNP Database

The primary goal of the project is to discover and map single nucleotide polymorphisms in tetraploid and hexaploid wheat and develop appropriate bioinformatic tools for public access to this resource. The secondary goal is to employ this resource in preliminary characterization of genetics structure of the genepools of tetraploid and hexaploid wheat and wheat diploid ancestors.

The Wheat SNP Database is now available. This database includes conserved and genome-specific PCR primers for amplification of STSs from genomic DNA of wheat and its diploid and tetraploid ancestors, DNA sequences, sequence annotations, intron/exon predictions, electropherograms, haplotypes, SNPs, and positions of SNP markers on wheat linkage maps.

ACM Digital Library

Full Text of every article ever published by the Association of Computer Machines and bibliographic citations from major publishers in computing.

arXiv – Quantitative Biology

A document submission and retrieval system that is heavily used by the physics, mathematics and computer science communities. It has become the primary means of communicating cutting-edge manuscripts on current and ongoing research. Manuscripts are often submitted to the arXiv before they are published by more traditional means.