Oligonucleotide microarrays consist of small DNA fragments (probes)
chemically synthesized at specific locations (spots) of a solid surface.
Probes are typically 25 to 60 nucleotides long and are synthesized in parallel, in a
series of repetitive steps. Each step appends a particular nucleotide to
selected regions of the chip. Selection occurs by exposure to light with the
help of a photolithographic mask or micromirror arrays.
Regardless of which method is used to direct light (masks or micromirror arrays), it is possible that some probes are accidentally activated for chemical coupling because of light diffraction, scattering or internal reflection on the chip surface. This unwanted illumination of regions introduces unexpected nucleotides that change the probe sequences, significantly reducing their chances of successful hybridization with their targets, and introducing cross-hybridizations. This problem is more likely to occur near the borders between a masked and an unmasked spot (in the case of maskless synthesis, between a spot that is receiving light and a spot that is not); hence the term border conflict. By carefully designing the arrangement of the probes on the chip and their embeddings (the sequences of masked and unmasked steps used to synthesize each probe), it is possible to reduce the risk of unintended illumination. This issue becomes even more important as more probes need to be accommodated on a single chip, which require the production of spots at higher densities with reduced distances between the probes. Our aim is to design the layout of a microarray in such a way that we minimize the incidence of the unintended illumination problem, what we call the microarray layout problem (MLP), using the term layout to refer to where and how the probes are synthesized on the array, i.e., their arrangement and their embeddings.The MLP appears to be hard because of the super-exponential number of possible arrangements, although no NP-hardness proof is yet known. Optimal solutions are thus unlikely to be found even for small chips and even if we assume that all probes have a single predefined embedding. If we consider all possible embeddings (up to several million for a typical probe), the MLP is even harder. For this reason, the problem has been traditionally tackled in two phases. First, an initial embedding of the probes is fixed and an arrangement of these embeddings on the chip with minimum conflicts is sought. This is usually referred to as the placement phase. Second, a post-placement optimization phase re-embeds the probes considering their location on the chip, in such a way that the conflicts with neighboring spots are further reduced. Often, the chip is partitioned into smaller sub-regions before the placement phase in order to reduce running times, especially on larger chips. Conflict Index ModelThe Border Length Minimization Problem [6] was the first formal definition of the problem of unintended illumination during the production of microarrays. The border length model evaluates the quality of a given mask. We have developed a more general model, called conflict index [1-5], that estimates the risk of synthesizing a particular probe incorrectly. In our model we also take into account two practical considerations:
Greedy+ and GeneChip ArraysWe have recently developed the first placement algorithm, called Greedy+, that for the first time combines the placement and embedding phases [1]. In [1], we also present the first large-scale layout-quality study on several old and recent GeneChip arrays, using Greedy+ to propose alternative layouts with reduced conflicts. Our analysis is based on the specification obtained from Affymetrix's web site containing the list of probe sequences and their positions on the chip.Original LayoutsMost GeneChip arrays for gene expression analysis, such as the Wheat Genome array, are synthesized in 74 steps, whereas the most recent genotyping arrays tend to use a 80-step sequence. On many GeneChip arrays, the probes appear in pairs: the perfect match (PM), which perfectly matches its target sequence, and the mismatch (MM) probe, which is used to quantify cross-hybridizations and unpredictable background signal variations. The MM probe is a copy of the PM probe except for the middle base (position 13 of the 25-mer), which is exchanged with its Watson-Crick complement. The layout of these arrays alternates rows of PM probes with rows of MM probes in such a way that the probes of a pair are always adjacent on the chip. Moreover, PM and MM probes are usually pair-wise right-most embedded. Informally, a pair-wise right-most embedding is obtained from right-most embeddings by shifting the first half of one embedding to the left until the two embeddings are "aligned" in the synthesis steps that precede the mismatched middle bases. This approach reduces border conflicts between the probes of a pair, but it leaves a conflict in the steps that add the middle bases.
Both SNP arrays have the peculiarity that all SNP probes are replicated in four identical copies that are guaranteed to be physically separated on the chip with a minimum distance of 1mm. This is achieved by partitioning the chip into four quadrants that are separated by a 1mm-thick central cross, which mainly contains copy number (CN) probes and control probes. Each copy of a SNP probe is placed on a different quadrant and, thus, each quadrant can be designed independently. Alternative LayoutsWe used our new algorithm Greedy+ [1] with different parameters $Q$ and Sequential re-embedding to
create alternative layouts for the Wheat Genome array and for the top-left quadrants of both Human
SNP arrays. For each chip we separately run both BLM and CIM versions of the algorithms. The main
difference between our layouts and the original ones is that we do not require the arrays to
alternate rows of PM and MM probes, nor do we place SNP probe pairs on adjacent spots. This is
especially helpful for CIM since it avoids conflicts in the middle bases. With BLM, we observe that
Greedy+ places between 90.7% and 91.5% of the probe pairs of the Wheat array on adjacent spots.
With CIM, this rate drops to between 13.0% and 17.6%.
The figure shows the NBL for each masking step of the layout produced by Greedy+ for the top-left quadrant of the SNP 6.0 array in comparison with the original Affymetrix layout. It can be seen that the CIM variant of our algorithm, using variable instead of right-most embeddings, is able to distribute the addition of middle bases over a larger range of synthesis steps (masks 20-60, approximately) and significantly reduce the incidence of border conflicts in these steps, where conflicts are expensive. Pivot PartitioningWe have developed a partitioning algorithm, called Pivot Partitioning [3], that combines the partitioning of the chip with the embedding of the probes. The main advantages of our approach over previous methods are: faster and better selection of pivots used to drive the assignment of probes to sub-regions; and improved assignment of probes to regions by considering all valid embeddings of a probe. Our algorithm outperforms the previous partitioning algorithms by over 8%.
Microarray Layout as a Quadratic Assignment ProblemWe have recently showed that the Microarray Placement Problem is an instance of the Quadratic Assignment Problem (QAP) [8,9]. The QAP is a classical combinatorial optimization problem that is, in general, NP-hard, and particularly hard to solve in practice. Here are some QAP instances derived from our models and the best solutions computed with QAP solvers. The problem instances were derived from small artificial microarrays containing between 36 and 144 probes. These probes must be assigned to specific locations of the chip (spots), whose dimensions range from 6x6 to 12x12. Once the assignment is made, there are two models that evaluate the quality of the layout: border length and conflict index (these models are somewhat correlated, i.e. a good layout has both low border length and low conflict indices). In our formulations, the flow matrix contains the Euclidean distance between the spots, while the distance matrix contains the (weighted) Hamming distance between the probe embeddings. The chips available here are square in shape and thus, for every solution, there are another seven symmetrical solutions. Because of the large number of probes on industrial microarrays (up to 1.3 million probes), it is not feasible to use any QAP method to design an entire microarray chip. However, it is certainly possible to design or improve small sub-regions of a chip. Since the instances that we work on are only a small part of the whole problem, we are more interested in methods that can solve a QAP rather quickly (in a few minutes). Nonetheless, we also report here results obtained with other time-consuming approaches.
Note that the results that appeared in [5] are averages over three runs on a set of ten random files uniformely generated (the files available here are only the first file of each set).
File FormatsProblems and solutions are available as plain text files with the formats used by QAPLIB. The format of the problem file is:
A B where n is the size of the instance, A and B are flow and distance matrices, respectively. The format of the solution file is:
p where n gives the size of the instance, sol is the objective function value and p is the corresponding permutation. Problem Instances and SolutionsThe results obtained with GRASP with Path Relinking (GRASP-PR) [3] used default parameters (32 iteractions, alpha = beta = 0.5) and finished in less than 5 minutes each. The solutions found with RTL-1 [4] and RTL-2 [5] were reported by Dr. Peter Hahn, with running times ranging from 6.7 to 29.2 hours on a Dell 7150 733 MHz Itanium CPU. Chris MacPhee reported the best solutions so far (obtained by a QAP solver), using GATS [6], a hybrid genetic / tabu search algorithm, with running times ranging from 84 seconds to 479 hours. This page will be updated when we hear about better solutions. To download all problem instances and GRASP-PR solutions, click here. Border Length Minimization
Conflict Index Minimization
ContactFor more information, please contact Sven Rahmann. References[1] S. A. de Carvalho Jr. and S. Rahmann. Improving the design of GeneChip arrays by combining placement and embedding. Proceedings of the 6th Annual International Conference on Computational System Bioinformatics (CSB), 2007. Preview (PDF). [2] S. A. de Carvalho Jr. and S. Rahmann. Modeling and optimizing
oligonucleotide microarray layout. In Ion Mandoiu and Alexander Zelikovsky,
editors, Bioinformatics Algorithms:
Techniques and Applications, Wiley Book Series on Bioinformatics. Wiley. [3] S. A. de Carvalho Jr. and S. Rahmann. Improving the layout of oligonucleotide microarrays: Pivot Partitioning. In P. Bucher et al., editors, Proceedings of the 6th Workshop of algorithms in Bioinformatics (WABI), volume 4175 of Lecture Notes in Computer Science, pages 321-332. Springer, 2006. Preview (PDF). [4] S. A. de Carvalho Jr. and S. Rahmann. Microarray layout as a quadratic assignment problem. In D. Hudson et al., editors, Proceedings of the German Conference on Bioinformatics (GCB), volume P-83 of Lecture Notes in Informatics, pages 11-20. Gesellschaft für Informatik, 2006. Preview (PDF). [5] S. A. de Carvalho Jr. and S. Rahmann. Microarray layout and the quadratic assignment problem (poster). 14th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB), Fortaleza, Brazil, 2006. Preview (PDF). [6] S. Hannenhalli, E. Hubell, R. Lipshutz, and P. A. Pevzner. Combinatorial algorithms for design of DNA arrays. Advances in Biochemical Engineering Biotechnology, 77:1-19, 2002. [7] Oliveira, C.A.S., Pardalos, P.M. and Resende, M.G.C. (2004): GRASP with Path-relinking for the Quadratic Assignment Problem. In Ribeiro, C.C. and Martins, S.L. (eds.), Efficient and Experimental Algorithms, LNCS, 3059, 356-368, Springer-Verlag. [8] Hahn, P., Grant, T. and Hall, N. (1998): A Branch-and-bound Algorithm for the Quadratic Assignment Problem Based on the Hungarian Method. European Journal of Operational Research, 108, 629-640. [9] Adams, W., Guignard, M., Hahn, P. and Hightower, W.: A Level-2 Reformulation Linearization Technique Lower Bound for the Quadratic Assignment Problem. To appear in the European Journal of Operational Research. [10] Rodriguez, J.M., MacPhee, F.C., Bonham, D.J., Horton, J.D. and Bhavsar, V.C. (2004): Best Permutations for the Dynamic Plant Layout Problem. In Dasgupta, A.R., Iyengar, S.S., and Bhatt, H.S. (Eds.): Proceedings of the 12th International Conference on Advances in Computing and Communications (ADCOM 2004), Allied Publishers Pvt. Ltd., New Delhi Ahmedabad, India, 15-18 Dec., ISBN 81-7764-717-2, pp.173-178. |

Oligonucleotide microarrays consist of small DNA fragments (probes)
chemically synthesized at specific locations (spots) of a solid surface.
Probes are typically 25 to 60 nucleotides long and are synthesized in parallel, in a
series of repetitive steps. Each step appends a particular nucleotide to
selected regions of the chip. Selection occurs by exposure to light with the
help of a photolithographic mask or micromirror arrays.
The latest Genome-wide Human SNP arrays do not contain PM/MM probe pairs. Most of the probes on
these arrays, however, are also designed in pairs to target the two possible alleles of a given SNP
position. These probes are identical except for one nucleotide. Unlike PM/MM probes, the mismatched
base is not necessarily its Watson-Crick complement, and the mismatch position can be anywhere
between bases 9 and 17, inclusive, of the 25-mer. On the Human SNP 6.0 array, one of the latest
GeneChip arrays, the probes of a pair are located on adjacent positions on the chip and are
pair-wise right-most embedded. In contrast, on the Human SNP 5.0 array, all probes were tiled as
single probes, i.e., they do not necessarily occupy adjacent positions on the chip, and they are
right-most embedded (but not pair-wise aligned). The figure shows the NBL for each masking step of three commercially available GeneChip arrays: Wheat
Genome, Human SNP 5.0 and Human SNP 6.0. An analysis of their layouts suggests three different
placement strategies.
used our new algorithm Greedy+ [1] with different parameters $Q$ and Sequential re-embedding to
create alternative layouts for the Wheat Genome array and for the top-left quadrants of both Human
SNP arrays. For each chip we separately run both BLM and CIM versions of the algorithms. The main
difference between our layouts and the original ones is that we do not require the arrays to
alternate rows of PM and MM probes, nor do we place SNP probe pairs on adjacent spots. This is
especially helpful for CIM since it avoids conflicts in the middle bases. With BLM, we observe that
Greedy+ places between 90.7% and 91.5% of the probe pairs of the Wheat array on adjacent spots.
With CIM, this rate drops to between 13.0% and 17.6%.