By successful re-identification of the EGFR amplification and SMAD4 homozygous

Analysis of data from eight diverse cancer cell lines shows that Roche Nimblegen and 454 exome sequencing technologies can be successfully applied to identify variations in gene-coding regions. From sequencing data with an average of 7.3-fold coverage, variants from the NCBI36 reference genome were identified in about 8% of all target regions on the exome capture array. While the majority of these variants could be confirmed in dbSNP database, OSMI-1 on average 0.16% of total target regions carry a novel variant. A comparison of SNP genotype calls from exome sequencing with data generated on the Affymetrix Genome-Wide Human SNP Array 6.0 showed that there is high concordance between the two technology platforms. The concordance is 97% for homozygous sites, and ranges from 30% to.90% iKIX1 at heterozygous positions, with accuracy dependent on sequencing read depth. Our analysis of the relationship between read depth and power of detection suggested that a minimum of ten-fold read depth is required for reliably detecting both alleles at heterozygous sites. These results provide guidance in planning future genome sequencing projects. For the seven examined cell lines that are also present in the COSMIC database, we show that 19 of 21 known mutations can be re-discovered by exome sequencing. Two previously described mutations were missing due to lack of sequence coverage. In one case this was due to incomplete coverage of the human exome in the Nimblegen 2.1 M capture array, indicating a need for improvements in array design. By successful re-identification of the EGFR amplification and the SMAD4 homozygous deletion in the MDA-MB468 cell line, we demonstrate that copy number alterations can be inferred from the sequencing read depth data. However, because of the stochastic nature of sequencing read depth and likely unevenness in the exome capturing process, in general it is not possible to reliably estimate copy-number information from our data. Applying the technology to more samples would help improve our ability to estimate and correct for systematic biases in the platform, and increasing the depth of sequencing reads would reduce the variance due to random fluctuation in read number.