ChIP-seq
MBD-seq (MIRA-seq)
BS-seq
RNA-seq
miRNA-seq
ChIP-Seq is a new frontier technology to analyze in
vivo protein-DNA interactions.
ChIP-Seq
◦ Combination of chromatin immunoprecipitation
(ChIP) with ultra high-throughput massively
parallel sequencing
◦ Allow mapping of protein–DNA interactions in-vivo
on a genome scale
Workflow of
ChIP-Seq
Mardis, E.R. Nat. Methods 4, 613-614 (2007)
Current microarray and ChIP-ChIP designs require
knowing sequence of interest as a promoter,
enhancer, or RNA-coding domain.
Lower cost
Higher resolution
Higher accuracy
Alterations in transcription-factor binding in
response to environmental stimuli can be evaluated
for the entire genome in a single experiment.
Solexa (Illumina)
◦ 1 GB of sequences in a single run
◦ 35 bases in length
454 Life Sciences (Roche Diagnostics)
◦ 25-50 MB of sequences in a single run
◦ Up to 500 bases in length
SOLiD (Applied Biosystems)
◦ 6 GB of sequences in a single run
◦ 35 bases in length
8 lanes
100 tiles per lane
Quality Scores
Sequence Files
10-40 million reads
per lane
~500 MB files
Quality scores describe the confidence of bases in each read
Solexa pipeline assigns a quality score to the four possible
nucleotides for each sequenced base
9 million sequences (500MB file) ~6.5GB quality score file
Rapid mapping of these short sequence reads to
the reference genome
Visualize mapping results
◦ Thousand of enriched regions
Peak analysis
◦ Peak detection
◦ Finding exact binding sites
Compare results of different experiments
◦ Normalization
◦ Statistical tests
Mapping Methods
◦ Need to allow mismatches and gaps
SNP locations
Sequencing errors
Reading errors
◦ Indexing and hashing
genome
oligonucleotide reads
Use of quality scores
Use of SNP knowledge
Performance
◦ Partitioning the genome or sequence reads
Fast sequence similarity search algorithms (like
BLAST)
◦ Not specifically designed for mapping millions of
query sequences
◦ Take very long time
e.g. 2 days to map half million sequences to
70MB reference genome (using BLAST)
◦ Indexing the genome is memory expensive
Both reads and reference genome are converted to
numeric data type using 2-bits-per-base coding
Load reference genome into memory
◦ For human genome, 14GB RAM required for storing
reference sequences and index tables
300(gapped) to 1200(ungapped) times faster than BLAST
2 mismatches or 1-3bp continuous gap
Errors accumulate during the sequencing process
◦ Much higher number of sequencing errors at the
3’-end (sometimes make the reads unalignable to
the reference genome)
◦ Iteratively trim several basepairs at the 3’-end and
redo the alignment
◦ Improve sensitivity
ELAND (Cox, unpublished)
◦ “Efficient Large-Scale Alignment of Nucleotide
Databases” (Solexa Ltd.)
SeqMap (Jiang, 2008)
◦ “Mapping massive amount of oligonucleotides
to the genome”
RMAP (Smith, 2008)
◦ “Using quality scores and longer reads
improves accuracy of Solexa read mapping”
MAQ (Li, 2008)
◦ “Mapping short DNA sequencing reads and
calling variants using mapping quality scores”
Partition reads into 4 seeds {A,B,C,D}
◦ At least 2 seed must map with no mismatches
Scan genome to identify locations where the seeds match
exactly
◦ 6 possible combinations of the seeds to search
{AB, CD, AC, BD, AD, BC}
◦ 6 scans to find all candidates
Do approximate matching around the exactly-matching
seeds.
◦ Determine all targets for the reads
◦ Ins/del can be incorporated
The reads are indexed and hashed before scanning
genome
Bit operations are used to accelerate mapping
◦ Each nt encoded into 2-bits
Commercial sequence mapping program comes with
Solexa machine
Allow at most 2 mismatches
Map sequences up to 32 nt in length
All sequences have to be same length
Improve mapping accuracy
◦ Possible sequencing errors at 3’-ends of longer
reads
◦ Base-call quality scores
Use of base-call quality scores
◦ Quality cutoff
High quality positions are checked for
mismatces
Low quality positions always induce a match
◦ Quality control step eliminates reads with too
many low quality positions
Allow any number of mismatches
Mapped to a unique location
7.2 M
12 M
Map to reference
genome
Quality
filter
3M
Mapped to multiple locations
1.8 M
No mapping
2.5 M
Low quality
0.5 M
BED files are build to
summarize mapping
results
BED files can be easily
visualized in Genome
Browser
http://genome.ucsc.edu
Robertson, G. et al. Nat. Methods 4, 651-657 (2007)
300 kb region from mouse ES cells
Mikkelsen,T.S. et al. Nature 448, 553-562 (2007)
Frietze et al JBC 2010
SISSRs (Site Identification from Short Sequence Reads): Jothi
et al. NAR, 2008.
MACS (Model-based Analysis of ChIP-Seq): Zhang et al,
Genome Biology, 2008.
QuEST
(Genome-wide analysis of transcription factor binding
sites based on ChIP–seq data): Valouev, A. et al. Nature
Methods, 2008.
PeakSeq (PeakSeq enables systematic scoring of ChIP–seq
experiments relative to controls): Rozowsky, J. et al. Nature
Biotech. 2009.
FindPeaks (FindPeaks 3.1: a tool for identifying areas of
enrichment from massively parallel short-read sequencing
technology.): Fejes, A .P. et al. Bioinformatisc, 2008.
Hpeak (An HMM-based algorithm for defining read-enriched
regions from massive parallel sequencing data): Xu et al,
Bioinformatics, 2008.
The MBD methyl-CpG binding domain-based
(MBDCap) technology to capture the methylation
sites. Double stranded methylated DNA fragments
can be detected. It is sensitive to different
methylation densities
Genome-wide sequencing technology was used to
get the sequence of each short fragment.
The sequenced read was mapped to human
genome to find the locations.
Lan et al Unpublished
BS-seq: genomic DNA is treated with sodium
bisulphite (BS) to convert cytosine, but not
methylcytosine, to uracil, and subsequent highthroughput sequencing.
Truly single-base resolution
RNA-Seq
is a new approach to transcriptome
profiling that uses deep-sequencing technologies.
Studies
using this method have already altered our
view of the extent and complexity of eukaryotic
transcriptomes. RNA-Seq also provides a far more
precise measurement of levels of transcripts and
their isoforms than other methods.
RNA-seq protocol
Single base resolution
High throughput
Low background noise
Ability to distinguish different isoforms and alleic
expression
Relatively low cost
© Copyright 2026 Paperzz