Contemporary experimental biology (in particular molecular biology) makes extensive use of data stored in databases. Sabina Leonelli claims that storing data in databases expands the evidential scope (i.e. the range of claims for which something can be taken as evidence) of data themselves. In order to be properly understood, this claim should be embedded into the framework provided by Bogen and Woodward in their seminal paper Saving the Phenomena. Bogen and Woodward argue that science is not interested in formulating systematic explanation about data, but rather about phenomena. According to this view phenomena are stable regularities for which data serve as evidence. However, data should be somehow ‘filtered’ to eliminate confounding factors (e.g. spurious or meaningless regularities, noise, etc.) in order to reveal phenomena. How data are processed varies from experiment to experiment, in the sense that any experimental context has its own peculiar confounding factors that complicate the identification of phenomena. Therefore the evidential scope of data is local, in the sense that it depends on the particular way data are produced. Leonelli challenges this idea by saying that data, when stored in biological databases, become nonlocal and their evidential scope is enhanced. In this talk, I will start from Leonelli’s point and I shall further develop it into an account of how data stored in biological might expand their original evidential scope. Therefore the question of this talk is: in which sense the evidential scope of data sets is enhanced once they are re-used? In what follows I shall argue that data stored in databases are not properly data, but they display many important features which are usually ascribed to phenomena and – moreover - they are stored as representing specific phenomena. By drawing from the distinction between phenomena types and phenomenon tokens I shall show that data, when stored in biological databases, are transformed into phenomena type. By ‘packaging data’, curators of databases actually make an operation of idealization over data. Data stored in databases can be used as types of (idealized) phenomena to be compared to other bits to data. Therefore the relation between data and databases might be understood as the model/world relation extensively analyzed in the philosophy of science. Specifically my case in point will be cancer genomics, where data stored in databases are used to eliminate specific confounding factors. In such a ‘data intensive’ enterprise, computational algorithm might potentially detect several phenomena but not all phenomena are actually useful for a particular research. First I will show exactly how databases are structured in cancer genomics. Then, I shall illustrate how, by using databases such as COSMIC or TCGA, it is possible to identify ‘phenomena’ of interest in large data sets, and to eliminate uninteresting ones. In particular, I will draw on a case where databases are used to establish whether some genes are cancer genes in a specific type of cancer (i.e. lung). Therefore by comparing ‘regularities’ detected in a lab to ‘regularities’ stored in databases (which are already characterized as specific phenomena), one can establish whether a regularity is a token of a specific phenomenon type. By being used as one horn of a typical model/world relation, data stored in biological databases expand the range of claims for which they are taken as evidence.
The Immune Epitope Database (IEDB) has been painstakingly curated from more than 16,000 publications -- nearly all the papers ever published in immunology, allergy, and autoimmune research. An epitope is the part of a molecule that is recognized by the immune system, and the heart of the IEDB is the set of more than 119,000 epitope records, covering positive and negative results from more than 700,000 individual experiments. The value of this costly data is vastly increased when it is linked to other resources, such as GenBank, UniProt, the NCBI Taxonomy, and the Gene Ontology. Every such resource is built on a set of constraints and assumptions, often implicit. Integration invariably means bringing these constraints and assumptions into conflict. In every case we find that external resources contain both too much and not enough data for our needs: too many taxa in the NCBI Taxonomy but too few transgenic mouse strains; too many proteomes in UniProt but not enough information about viral polyprotein cleavage products. We must select and extend the data from the external resources, respecting their constraints while bringing them in line with our own. As we resolve these conflicts we change, and hopefully clarify, the meanings of our data, exposing exceptions and errors that have to be re-curated. Integration on this scale breaks new ground, raising many questions and demanding careful reflection on the science, the data, and their use. In this paper I present my perspective on these practises, both as philosopher of science and as a member of the IEDB team.
In biomedical research an experimental result can be grounded on the consistency of the methods adopted and on the locality of its production, namely, the experimental conditions. As also remarked by Jacob “in biology, any study [...] begins with the choice of a ‘system’. Everything depends on this choice: the range within which the experimenter can move, the character of the questions he is able to ask, and often also the answers he can give” (Jacob 1987). Thus biological findings seem to be strictly dependent on the locality of their production. The possibility of a generalisation is very problematic in biology and the claims about biological phenomena beyond the locality of data is often difficult within traditional approaches. Ontologies seem to overcome such a locality (see for instance Leonelli 2009) since they can exploit the knowledge produced in a specific context and make it disposable to another (even very different) one. To put it differently ontologies broaden the Jacob’s notion of experimental system to the entire realm of biological knowledge. However in these terms such a difference could be just an evocative picture. My proposal is to provide an epistemic justification for the unifying power of ontologies. In particular, I will focus on Gene Ontology (GO) structure. By examining both the epistemic reasons for its implementation and the type of analysis provided by GO, I will show how such a tool resembles some features of a map but nevertheless constitutes something new in the epistemological scenario. Not entirely a theory, more than a model (but structurally similar to it), I will argue that GO a novel category within the epistemic repertoire. I then claim that the knowledge provided by GO should be seen as a more or less effective tool through which we can discriminate, among an enormous amount of data, a convenient way of organising those empirical results which were at the basis of the GO analysis. Accordingly, such a specific status will be better specified given that GO is both conventional, as the result of epistemic interactions towards a common agreement, and normative, since the tool shapes the representation of knowledge as it will be perceived by other, future researchers. In conclusion I will suggest that GO is an orienteering tool on which scientist can map their data on a wider context and then, thanks to this, elaborate new experimental strategies. GO is then a map for making the conceptual content of a particular experimental condition comparable across different research contexts. Such a map is essential not as a way to confirm experimental results but as a way to compare experimental results with the theoretical background (the so called ‘big picture’). Lastly, I will face the fact that ontologies are considered a unification tool. In taking into account the possibility of such a generalisation (beyond the locality of data production), I will show that GO does not create, per se, a unification for the theoretical content. My proposal is then to clarify what and how exactly GO is unifying.
“Data flow” is a common term used in the context of data-intensive sciences, which have managed to turn the production of scientific data into an automated information stream from the observation or experimentation instrument, to online databases, and into the researcher’s office computer. The flow is, however, only one of several widely used metaphors, which relate in various ways to the image of a data ecosystem (Star and Griesemer, 1989; Baker and Bowker, 2007; Parsons et al., 2011), and which carry ontological implications at the same time. While being a powerful instrument for conceptualisations of data management and knowledge production, the flow and ecosystem metaphors’ implication that data can move as smoothly, effortlessly, and cohesively as a natural river is also criticised (Edwards et al, 2011; Leonelli, forthcoming). I contend that an ecosystem’s processuality, its adaptability and complexity contradict with data’s desired robustness and its character as mineable and quantifiable product. Moreover, the epistemic role of data and data management practices in particular remain largely unclear in light of this blurred ontological image of data.
My paper will take two approaches to elucidate this image: I will firstly discuss the ambiguous ontic implications of data and data ecosystems with respect to the dichotomy of process ontology and substance or object ontology. These positions serve as two extremes that mark the limits of a field, in which conceptions of data management systems and examples of data ecosystems from case studies can be situated and put into relation. My second approach is based on empirical ontological study, which takes into account recent developments in science and technology studies, which have highlighted the possibility of ontologically differing enactments of entities and pluralisation depending on situation (Lynch, 2013). My paper analyses data practices in contemporary ocean sciences with respect to data flows, ecosystems, and their ontological enactments. In contrast to climate sciences’ global infrastructures (Edwards, 2010), oceanographic data is produced and processed in systems with larger ranges of complexity and automatization. Oceanographers not only produce heterogeneous data from remote and often inaccessible areas; they create methods to produce robust and reusable data of ocean phenomena, which are themselves highly complex and processual.
This talk discusses the notion of ‘data model’, its current role in philosophy of science and what focusing on these objects can teach philosophers about the complex relationship between data and models. My discussion is grounded on an empirical approach to philosophical analysis, in which the discussion of the epistemic role of data and models is grounded on a study of how contemporary scientists are using these research components to explore the world and reason about it. I start by arguing that data models have often been seen as oversimplified/idealised version of actual dataset whose main function is to make data useful as evidence for theoretical claims (for instance, in a 2006 review paper Roman Frigg defines them as a “corrected, rectified, regimented and in many instances idealized version of the data we gain from immediate observation, the so-called raw data”); and that capturing the relation between data and models in this way has inhibited a close philosophical investigation of the status of data in scientific research and its relation to modeling. Building on ongoing empirical work of how data are circulated and modeled in contemporary plant science, I then reflect on the status of data models, the extent to which they can be viewed as ‘representations’ of a target system, the possible differences between data models and ‘simple’ datasets (and their respective roles as communication and exploration tools within and across scientific communities), and the crucial importance of these tools towards the achievement of scientific understanding.
The consensus view in the evolutionary history of Mitochondria is that this intracellular structure emerged from endosymbiosis. This idea asserts that mitochondria, once free living α-proteobacterias, were integrated by another cell, and were progressively specialized to become an organelle, mostly serving as a “power plant” for the host cell. While this broadly sketched scenario is now widely accepted, there is much disagreement on its details. The object of this talk is to understand the difference in the data used by the two main antagonists in this debate, how these different sets of data result in differing hypothesis, and the consequence in the philosophical debates that it bears. On one side is the ‘hydrogen hypothesis’, first defended in a 1998 Nature article by William Martin and Miklos Müller. This hypothesis takes place in an oxygen-deprived (anaerobic) environment, where a hydrogen-consuming archaebacterial took advantage of a hydrogen-producing α-proteobacteria and eventually integrated it within its cytoplasm. A subsequent amount of gene transfer between the host and symbiont increased the host’s metabolic versatility, and allowed this association to thrive in oxygen-containing (aerobic) environments. The acquisition of the mitochondria in this scenario is seen as key to the formation of eukaryotes and to the subsequent increase of complexity observed in these organisms. The other camp’s main protagonist is Thomas Cavalier-Smith, which elaborates since 1975 his ‘phagotrophic hypothesis’. The phagotrophic hypothesis argues that the key event in the acquisition of mitochondria, as well as in the development of eukaryotes, is the acquisition of a system of internal membranes which is crucial to the evolution of phagocytosis. Phagocytosis made possible the integration of an α-proteobacteria which was progressively enslaved to become a mitochondria. The energy efficiency increase provided by the presence of the mitochondria is seen as one step among others in the evolution of eukaryotes, secondary in importance to the membrane innovations that helped the integration of the different organelles and the nucleus. I will argue that the former hypothesis restricts the scope of its data on genomic ones, despite being secondarily supported by metabolic constraints. On the contrary, the second hypothesis is based on more varied sources of data, and in this case genomic data are more treated as secondary sources which are fitted in a model built with cellular biology and fossil records data. The philosophical assumptions lying behind the ‘hydrogen hypothesis’ are of importance for recent debates in philosophy of biology, mainly in the contestation of the Tree of Life hypothesis, and many voices now defend that evolutionary relations between species are of a network nature. We will see that this contestation is also grounded in a similar genome-centred perspective that provokes the ignorance of other kind of data. With this talk, I would therefore to illustrate how the usage of data can shape philosophical discussions, and assess the impact that a more diversified approach, like the one of Cavalier-Smith, can have.
How effectively communities of scientists come together and co-operate is crucial both to the quality of research outputs and to the extent to which such outputs integrate insights, data and methods from a variety of fields, laboratories and locations around the globe. This paper focuses on the ensemble of material and social conditions within which organismal research is situated that makes it possible for a short-term collaboration, set up to accomplish a specific task, to give rise to relatively stable communities of researchers. We refer to these distinctive features as repertoires, and investigate their development and implementation in a key case study in contemporary biological sciences, namely how research on individual organisms evolved into model organism communities. We focus particularly on the ways in which the epistemic value of data as evidence is shaped by the features of the materials via which data have been generated, as well as the ready availability of access to these materials. This is a typically overlooked aspect of data epistemology, affecting both how data is encoded in databases and how its provenance is portrayed and interpreted when data are re-used, integrated and/or developed in further research. We conclude that whether a particular project ends up fostering the emergence of a resilient research community is partly determined by the degree of attention and care devoted by researchers to material and social elements beyond the specific research questions under consideration.