Biotechnology
| Poster #541 | |
| » | Abstract |
| » | Introduction |
| » | Materials and Methods |
| » | Conclusions |
| » | Download PDF |
![]() |
|
Biotechnology
| Poster #541 | |
| » | Abstract |
| » | Introduction |
| » | Materials and Methods |
| » | Conclusions |
| » | Download PDF |
![]() |
|
Hilary G. Morrison, Andrew G. McArthur, Julie E.J.
Nixon, Nora Q.E. Passamaneck, Ulandt Kim, Melissa K. Crocker, Gregory Hinkle,
Michael E. Holder, Rebecca Farr, Claudia I. Reich, Gary J. Olsen, Lorena
A. Fierro, Stephen B. Aley, Rodney D. Adam, Frances D. Gillin and Mitchell
L. Sogin
The Josephine Bay Paul Center for Comparative Molecular Biology and Evolution
The Marine Biological Laboratory, Woods Hole, MA 02543
E-mail: morrison@mbl.edu, sogin@mbl.edu
Giardia lamblia genomic libraries were prepared in the plasmid vector pUC18, a 2.7 kb vector with primer sites for M13 forward (universal) and M13 reverse primers. We use a shotgun sequencing approach (Fig. 2).
The majority of single pass read data comes from one library, which has an average insert size of 2.35 kbp and a range of insert sizes from 1.6 to 2.7 kbp. Plasmid templates are prepared following a standard Qiagen REAL96 protocol on the BioRobot 9600. Template quality is checked by agarose gel electrophoresis. Each end of the cloned insert is sequenced using LI-COR's simultaneous bi-directional sequencing protocol (SBS, Roemer et al., 1997). The M13 universal (forward) primer is labeled with IRDye® 700 infrared dye and the M13 reverse primer with IRDye® 800 infrared dye. We use Epicentre's Excel II® cycle sequencing protocol. Sequencing reactions (7 µl volume) are assembled using the Tecan Miniprep 75. The samples are loaded onto 3.75% KBPlus™ polyacrylamide gels and run for 12 hours on a LI-COR 4200 sequencing machine.
A single SBS reaction generates 900-1100 base reads from each primer, which
are given unique and infor-
mative names (Fig. 3). We use LI-COR automatic base-calling software, then
manually edit each sample. This results in single pass reads with an extremely
high degree of accuracy (>99%), which are made available to the scientific
community (www.mbl.edu/Giardia and www.ncbi.nlm.nih.gov).
The data pipeline bypasses normal ABI-type PHRED base calling because of the superior performance of the LI-COR base caller when presented with LI-COR generated TIFF images.
|
Figure 2
|
Figure 3
|
||||||||||||
After
data acquisition, base calling and editing (A, B), SCF and phd (quality
value) files are generated on an OS/2 platform using the bulk SCF/phd executable "makeallscfphd," or
on a PC using eSeq™. Copies of the SCF and phd files are moved into
the assembly and analysis pipeline using a Unix shell script called "Harvest" and
all the original files associated with a sequencing run (text files, TIFFs,
sample files, etc.) are archived (C). The phd files are converted to fasta
format and trimmed of vector sequence using the PHRED routines "phd2seqfasta" and "crossmatch" (D).
As new data are harvested, the file of trimmed fasta format sequences is
converted to a BLAST-searchable database using the NCBI program "formatdb" (E).
Each fasta file is used as a query sequence to search the nucleotide and
protein sequence databases for homologies to known sequences using the
BLAST algorithm (Altschul et al., 1990). BLAST results are parsed and converted
to HTML format using a suite of UNIX scripts (PreviewBlast and WebRelease).
Included in these scripts are a number of error-checking steps to ensure,
for example, that all sequences contain unique identifiers. The BLASTX
results are the only "annotation" available for the first pass data (F).
The primary reads are assembled into sequence contigs using PHRAP and CONSED (Gordon et al., 1998). As described previously, phd files are trimmed of contaminating vector sequence and converted to fasta format. The merged fasta sequence file and associated quality file are used by PHRAP to generate contigs of overlapping reads, with parameters such as mismatch and minimum overlap set by the user (D). Currently, we set the overlap to 50 bases and allow no more than 2% mismatch. Contig sequences are used to construct a BLAST searchable database (E) and to identify target clones for sequence closure.
|
Figure 4B
|
CRITICA (Coding Region Identification Tool Invoking Comparative Analysis"; Badger and Olsen, 1999) is used to identify likely protein coding sequences in each contig. In the comparative analysis component of CRITICA, regions of DNA are aligned with related sequences from public databases, and greater than expected amino acid identity indicates a likely coding sequence. Proteins identified by CRITICA are entered into the ERGO database at Integrated Genomics (www.integratedgenomics.com) for functional analysis. In this process, a model of connected metabolic pathways is constructed. Using a model-based approach, rather than looking at isolated coding regions, means that functional annotations are more reliable and "missing" functions are more readily recognized. The initial static model is refined, using a web-based interface, by curators at Integrated Genomics, the MBL, and other institutions.
Figure 4C

Figure 4D

Figure 4E

Figure 4F
