Finding Genes in the Human Genome
In order to study genes for a wide variety of research, diagnostic, or therapeutic purposes, scientists use computer programs that analyze DNA sequences. These programs indicate where pieces of genes are located within what is frequently a vast and complex genetic landscape. Although conventional programs detect many parts of genes with ease, they fail when it comes to detecting two important elements-the very first pieces of genes, and the nearby "on" switches of genes called promoters.
Researchers in the bioinformatics group at Cold Spring Harbor Laboratory have now developed a computer program that is especially good at finding these first segments and "on" switches of genes. The program is tailored toward detecting these features in the human genome sequence, but it will also be useful for annotating other mammalian genomes.
The program-called "First Exon Finder" or "FirstEF"—was developed by Michael Zhang and his colleagues. A paper describing the program is published in the December issue of Nature Genetics.
"FirstEF is the first program that can readily and accurately detect a class of gene segments that has previously been extraordinarily difficult to find," says Zhang. "It's like looking for buried treasure."The gene segments Zhang is referring to occur at the very beginning of genes, and are called "non-coding first exons." Because they do not encode protein segments, non-coding first exons are undetectable by conventional computer programs that rely on protein coding patterns found in DNA.
Instead, FirstEF recognizes five other DNA "signatures" that betray the presence and location of first exons in genes. The biological basis of some of these telltale genetic signatures is unknown, says Zhang. "But they are real, and perhaps someday biology will explain why they are there." One such signature is the frequency with which two building blocks of DNA, C and G, occur next to each other.
Despite the fact that they do not encode protein, non-coding first exons are essential components of gene structure and function. Consequently, the ability to detect non-coding first exons is crucial for scientists wishing to study genes for a wide variety of biological and biomedical applications.
"The results Michael Zhang is getting with FirstEF are very exciting," says James Kent, a graduate student at the University of California at Santa Cruz. Kent's own computer program called "GigAssembler" caused a sensation in the world of genome research when he used it to generate the first and only publicly-available assembly of the human genome sequence in June of last year. Kent hopes to add a FirstEF "track" to the Human Genome Browser he has created (available at http://genome.ucsc.edu).
When Zhang used FirstEF to analyze the DNA sequences of human chromosomes 21 and 22, he found that the program correctly pinpointed the location of 90 percent of known first exons on those chromosomes. According to Zhang, FirstEF was nearly twice as sensitive as a program available from DoubleTwist, Inc. and Genomatix Software GmbH called "PromoterInspector." Zhang was joined in this study by postdoctoral researchers Ramana Davuluri (now on the faculty at Ohio State University) and Ivo Grosse.
Later, Zhang and his colleagues used FirstEF to analyze the entire human genome. They identified some 68,000 first exons. This result does not necessarily mean that there are 68,000 or so human genes, because a single gene can use alternative first exons. Moreover, the total number of genes in an organism's genome depends on other, subtle definitions of what constitutes a gene. Nevertheless, Zhang believes there are 50 to 60,000 human genes and that previous estimates of 30 to 40,000 human genes are too low.
One bonus of the way FirstEF operates is that it identifies not only first exons of genes, but also the "on" switches of genes called "promoters."
"A significant bottleneck in current DNA research is finding the promoters of genes. Because gene promoters and first exons are related, FirstEF kills two birds with one stone," says Zhang.
Supplemental Information C:Cold Spring Harbor Laboratory is a private, non-profit basic research and educational institution. Under the leadership of Dr. Bruce Stillman, a member of the National Academy of Sciences and a Fellow of the Royal Society (London), some 300 scientists conduct research in cancer, neurobiology, plant genetics and bioinformatics. The laboratory is recognized internationally for its educational activities, which include an extensive program of scientific meetings and courses that attract more than 8000 scientists to the campus each year. In addition, the laboratory trains college students through its Undergraduate Research Program, operates the DNA Learning Center in Cold Spring Harbor, and runs the Partners for the Future program for high school seniors. Other components of the laboratory include the Watson School of Biological Sciences, named for laboratory president and Nobel laureate James D. Watson; the Cold Spring Harbor Laboratory Press; and the Banbury Conference Center. For more information, visit the Laboratory's web site at www.cshl.edu or call the Department of Public Affairs at (516) 367-8455.