Probe Selection Problems in Gene Sequences DNA Microarrays

Probe Selection Problems
in Gene Sequences
DNA Microarrays


cDNA: PCR from clones
Oligonucleotide: design specific probes
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Gene Detection Using Microarrays
….AGCTCTAGGCCCAT….
….CCCATGGACTCAAG….
….CCCTGGCGACAGTT….
….AATCCTACGACGGC….

Each probe
 Selected from gene sequences
 Same length (20mer~)
 Detect one gene
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Probe Selection

Uniqueness
 Comparing potential probe sequences with the full-length
sequences of other genes being monitored

Hybridization characteristics
 Tm among probes
 GC content
 Secondary structure
 Position of core region
 Number of mismatch
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Affymetrix Probe Selection Criteria

Single base(As, Ts, Cs, Gs)
 Not exceeds 50% of the probe size

Length of contiguous As and Ts or Cs and Gs region
 Less than 25% of the probe size

(G+C)%
 Between 40 and 60% of the probe sequence
 (G+C)% can be adjusted based on G+C contents of a genome
sequence

Contiguous repeat
 No 15-long repeat anywhere in the entire coding sequence of
the whole genome

No self-complementary within probe sequence
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Tm Calculation

Approximate target DNA concentration

: enthalpy for helix formation

: entropy for helix formation
 R : molar gas constant (
)
 c : total molar concentration of the annealing oligonucleotides
when oligonucleotides are not self-complementary
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
General Approach

Probe sequence 생성
 임의의 서열을 만들어 후보 probe 를 생성한다.
 Target gene 서열로부터 후보 probe 를 생성한다.
 일반적으로
유전자의 처음부터 sliding window방식으로 생성
한다.
ACGCGTCGCGAGGCCTAGGCC…
후보 probe 시퀀스
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
General Approach



후보 probe 서열 중 target sequences 이외의 유전자에
흔히 존재하는 것 들을 삭제한다. (blast 등과 같은
sequence alignment 프로그램을 이용한다)  cross
hybridization 발생 가능성 억제
Target과 probe간의 hybridization이 잘되게 하기 위해
Tm값을 이용해 후보 probe 서열들을 필터링한다.
Sequence의 2차원 구조에 따른 필터링을 한다.
Intramolecular 구조가 형성될 수 있는 것들을 제거.
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
GA Representation


Chromosome represents possible probe set.
Each position of a chromosome is the starting point of
probe sequence in gene sequence.
probe set (size n: number of target genes)
3
2
15
90
.....
start position
of selected region
one probe
(parameter: probe length c)
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
GA Operators

Fitness function
 Linear combination of sequence match and probe characteristics
 Combination rate can be varied with generation
 Sequence match between probe and target gene
Perfect match: 1
 Mismatch: 0

 Considered probe characteristics
Tm
 (G+C)%
 Self-complementary (by sequence comparison)


Population
 Population
 Parents are selected from roulette-wheel
 One-point crossover with Pc, mutation with Pm
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Optimizing Probe Set


One probe(spot) is designed to detect one gene.
Assumption
 Optimal probe set will contain as few oligonucleotide as
possible

Goal
 M probes detect N genes (M is less than or equal to N)
 Find M probes which detect N genes
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Example


3 target genes
….AGCTCTAGGCCCAT….
….CCCATAGGCTCAAG….
….CCTAGGCGCGCTCA….
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
GA Representation

Probe sequences are selected from target gene sequences
probe set (size m: number of probes)
1
8
20
10
.....
gene id
3
2
15
90
.....
start position
of selected region
one probe
(parameter: probe length c)
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Sequence Matching


Target
, probe
Matching result between target ci and probe set
 Can be represented as list or vector
  fingerprint

 Two clones are distinguishable
 Find probe set which make all fingerprints different each other.
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Issues




Consider number of mismatch
Consider Blast search
Design of probe sequences
For fitness conditions, probe characteristics can be
learned?
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/