Christian Stratowa Distributed Storage and Analysis of Microarray Data in the Terabyte Range: An Alternative to BioConductor ************************************************************************** Novel high-throughput technologies such as DNA microarray analyses are allowing biologists to generate sets of data in the terabyte realm. Many of these data will be deposited in the public domain, necessitating a common standard. Currently available database systems are not appropriate for these intentions. In this paper, I will introduce an object-oriented framework that has been specially designed for distributed data warehousing and data mining of data in the terabyte range. Data are stored as sets of objects in machine-independent files, and specialized storage methods are used to get direct access to separate attributes of selected data objects. The system has been designed in such a way that it can query its databases in parallel on SMP/MPP machines, on clusters of PC's, or using common GRID services. In order to demonstrate the applicability of this framework to microarray data, I will present a functional prototype system, called XPS - eXpression Profiling System, which can be considered to be an alternative to the BioConductor project. The current implementation handles efficient storage of Affymetrix GeneChip schemes, gene annotation and data, and the pre-processing, normalization and filtering of GeneChip data. Based on this system, I will propose a novel standard for the distributed storage of microarray data. Finally, I will emphasize the similarities between R and this framework, and show how R could be easily extended to access the described data analysis framework from within R.