Association canadienne des utilisateurs de données publiques/
Canadian Association of Public Data Users (CAPDU)
Submission to the National Data Archive Consultation Working Group
17/11/2000
Summary
The Association canadienne des utilisateurs de données publiques/Canadian Association of Public Data Users (CAPDU) strongly supports the objectives of the National Data Archive Consultation (NDAC), ie to explore the need for a national data archive.
In response to the specific issues being addressed in the first round of the NDAC consultations, CAPDU's submission argues as follows:
1. In a period when other countries, including newly emerging former Eastern bloc countries, are building national data archives, Canada is the only country to have had one and closed it down.
2. The reasons for having a national data archive are numerous and compelling, especially in an information-based economy. Data is to information as iron ore is to steel.
3. Currently, Canada has no institution with the mandate and the resources to adequately provide data archiving for even one much less three sectors which are traditionally data producing sectors: government, academia, and the commercial sector.
4. Sweeping changes will need to be made to policies, and a substantial allocation of resources will be needed to provide adequate archival capability of the scope needed to provide for current requirements.
5. Depending on the mandate and resources of a national data archive, substantial benefits are predicted to accrue to all stakeholders, including data producers in all sectors, government departments, policy makers, research funding agencies, researchers, students, the commercial sector, and ultimately the public at large.
6. There are a number of areas in which there will be direct, observable benefits to Canada's research capacity arising from effective and cohesive data management, preservation, and data access:
- Canadian capacity to study change over time of especially social, economic, and medical phenomena will benefit from the salvage of key historic data sets, as well as improved and coordinated access to contemporary data files of scattered origins,
- Canadian capacity for comparative research over geographic space will be enhanced by not only improved access to Canadian data, but by the leverage such access will give Canada when negotiating international data exchange agreements,
- Canadian capacity for analysis of research data will be vastly enhanced by a cohesive collection of research data upon which to build new research and statistical techniques tailored to the unique conditions in Canada,
- Canadian capacity for research will be vastly improved by the development of a research cadre in all economic sectors that is knowledgeable about Canadian data,
- improved Canadian competitiveness through international collaboration and increased knowledge generation.
7. It is our submission that the infrastructure required to provide long-term preservation of significant research data in Canada requires:
- an institutional mandate that includes archiving of data collected by government, academia, and commercial sectors,
- an institutional mandate to disseminate or provide access to research data in its collections to ensure that they are most efficiently used for secondary analysis
- an institutional mandate that recognizes the current and future potential for cross-disciplinary research encompassing many kinds and sources of data,
- an institutional mandate that complements those of the National Archives and National Library so as to maximize the preservation of research and information resources in Canada,
- an institutional mandate to lobby government at all levels, and the academic sector for needed revisions to legislation as well as policy,
- an institutional mandate empowering the national data archive to negotiate for Canada in data-related international policy, standards, and data exchange activities and initiatives,
- the resources to provide the level of activity needed to preserve research data (in all disciplines) in perpetuity.
Introduction
The Association canadienne des utilisateurs de données publiques/Canadian Association of Public Data Users (CAPDU) strongly supports the objectives of the National Data Archive Consultation on research data archiving and access (NDAC), ie to explore the need for a national data archive. The lack of a national data archive was correctly diagnosed in the mid-1970s and led to the implementation of the Machine Readable Archives (MRA) Division of what is now National Archives of Canada. The continuing lack of such a facility has been a subject of much concern since the demise of that division in the mid-1980s.
First, it is important to clarify what materials we consider the proper object for collection by a national data archive.
There is a model that defines a useful continuum for this purpose as follows: from data is derived information, from information is derived knowledge, and from knowledge is derived wisdom. The individual responses given by respondents on the 1996 census of population questionnaire is data (thoroughout this submission, 'research data'), To illustrate: the descriptive statistics produced from these data that determines that 70.25% of those 85 years of age or over are female (or 236,800 women 85+ as opposed to 100,280 males 85+) constitutes information. The correlation of this information with information on the prevalence of certain diseases prevalent among elderly women to predict resource demands on the health systems once the post-war baby boom begins to approach the age of 85 constitutes knowledge, and the correct translation of that knowledge into appropriate resource allocation within the health sectors 30 years from now constitutes wisdom.
The collections of traditional libraries are concentrated at the 'information' and 'knowledge' levels in the above continuum. Traditional archives too tend to collect at the level of information products, e.g. the information products on which policies are based. A national data archive, and existing data archives and services derive their collections, however, at the first level, the 'raw data' level. That is to say, data at the level at which it was originally collected, whether by means of a survey of individuals, or Canada Customs individual records for each load of goods passing the Canadian border (transaction level), or the original digitized vectors comprising a map file, before it has had subject matter, colours, titles, and coded symbols etc. added to it. This raw, pre-digested or pre-summarized format is the single most useful format for all research data, and is the format from which all other derived products flow. Once data have been summarized, or post-processed in some way, it is usually impossible to recapture the original raw data.
Phase 1 of the NDAC is directed to address four main issues:
1. To what extent is there a need for a unified and coordinated data archiving function? Are modest changes to existing institutional policies and mechanisms adequate to meet current and future requirements?
2. What gaps exist in the mandates and structures of existing institutions in relation to management of research data?
3. Who will benefit from the improved management of research data and to what degree?
4. How will effective research data management, preservation and access contribute to Canadian research capacity?
This submission seeks to address each of these questions, in turn.
Part I: To what extent is there a need for a unified and coordinated data archiving function? Are modest changes to existing institutional policies and mechanisms adequate to meet current and future requirements?
I.1. Within the past twenty years, more than 25 countries around the world have created a national data archive or equivalent institution with a mandate that addresses the need to collect, manage, and preserve research data. These countries joined the 6 countries that already earlier, in the 1960s or 1970s, had created such institutions. These developments are the outgrowth of parallel efforts through the 1960s and 1970s under the auspices of UNESCO, the International Social Science Council, and the National Science Foundation (U.S.) to promote the organizational and informational infrastructure for comparative social science research. Similar developments have occurred in the natural and physical sciences, under the aegis of Codata, and more recently the International Global Change Programme.
In this period, only Canada is known to have to have closed down its national data archive equivalent (the MRA).
Among the over 30 national data archives or equivalent around the world, mandates differ. Some have a mandate to collect only data produced in the academic or commercial sectors, others are mandated to collect only data produced by and for government. Some national data archives are mandated to collect research data in the social sciences, others in the humanities, some in the natural or physical sciences, and yet others derive their mandate from a broad spectrum of disciplines. Several countries have more than one national data archive or equivalent, each with an area of specialization, such as social sciences vs. physical sciences vs. humanities, quantitative data vs qualitative data, government sector data vs academic and commercial sector data.
All of these countries have recognized the need for a national institution, appropriately mandated and funded, to provide for the long-term preservation of research data. The Economic and Social Research Council (ESRC) in the United Kingdom is currently conducting a review of its role vis-a-vis ESRC-funded data depositories and the deployment of ESRC funds in relation to data acquisition and provision. To this end, a Green Paper on data policy and data archiving (<http://www.ilrt.bris.ac.uk/ubris/esrc/p3.html >) was produced this year, to focus the discussion in the areas the ESRC seeks to review. Throughout the Green Paper, there is clearly expressed understanding of and support for the crucial role of management and preservation of research data.
I.2. Why is the need to archive research data seen as so important?
There are a number of reasons why the preservation of research data, as much in other disciplines as in the social sciences, is of crucial importance.
The ESRC Green paper on data policy and data archiving (ibid.) succinctly identifies the most significant of these reasons:
"Data are often collected at significant expense, involving the use of substantial expertise and respondents' time and effort. Use of existing data and encouraging secondary analysis can avoid the need for costly primary data generation in some cases. Initial research projects, moreover, frequently exploit only a small part of the full potential of the data which they generate. Archived data can also provide the basis for comparative research between locations or case studies and over time - including historical study. It allows researchers to revisit and look again at earlier research findings for purposes of verification or elaboration. It can also provide methodological insights and can be a valuable resource for teaching purposes. To meet these needs, data needs to be preserved in a form which will ensure their future availability and usability."
I.2.1 The first of these reasons concerns the often significant expense of collecting, recording data and making it available for analysis, as well as the problems arising from the respondent burden phenomenon.
A small survey, in today's economy, can cost in the order of Can$50,000.00 dollars to conduct, and 10 to 20 such surveys are conducted in the academic sector alone in Canada each quarter. Larger surveys are comparatively more expensive. On average every four years, Canada conducts an election survey which costs between one-third to one-half of one million dollars to conduct (the 1997 survey cost approximately $420,000.00).
Given the significant resources expended in the collection of these data files, it is important that they be thoroughly exploited. Often, those who fund research are content to receive, as a product of that research, nothing more than a report of the major findings. This is analogous to contracting with a builder to build a house, and at the end of the construction process, accepting as a product a photograph of the house, and allowing the builder to keep ownership of the house itself. Given the resource investment in the collection of any research data, considerably greater use should be made of them.
Although not all research data are collected from human subjects nor via the interview process, much research data, especially in the social and medical sciences is of this nature. More and more people in Canada are finding themselves the target of survey interviews, from government, the academic sector, and the commercial sector. There is well documented growing reluctance to participate in surveys, with attendant response bias problems, because of the increase of surveying, as well as deep concerns about individual privacy and confidentiality. One of the means of reducing respondent burden is a greater emphasis on the secondary analysis of existing data files.
The secondary analysis of research data (i.e. analysis of data collected for one research purpose, in the examination of other research questions) both maximizes the efficiency of the original data collection, and helps to reduce respondent burden.
I.2.2 The second reason identified by the ESRC is the tendency of most data collection activities to collect more data than is immediately required. Consequently, much of the research potential of any data file is not exploited by the original researcher directing the collection.
For example, the 1985 General social survey on health status and social support collected a total 428 variables. Statistics Canada published one report containing descriptive statistics, in 1987, and which contained 87 tables which showed the relationship between some of the key variables in the data file, with standard demographic variables such as age and sex, and in a few cases, education or household income, representing descriptive statistics on less than one-half of the variables in the original data set. Within 10 years, however, an additional 48 known Ph.D. and Master's theses, conference papers, periodical articles, etc., had been produced, analyzing the 1985 data, not merely producing descriptive statistics from them, but utilizing those data to explore the significance of relationships between various phenomena. If the 'efficiency' of a data file can be said to be measured in terms of the number of publications it produces, this represents a significant increase in the research conducted from this one survey.
The 1985 GSS survey data were part of a pre-Data Liberation Initiative (DLI) consortium acquisition by 25 universities in Canada. So that from the beginning it has been 'pushed' at the major research universities, and later as part of the DLI collection, has been widely available as part of a cohesive and widely known collection of Canadian research data. Its usage pattern is illustrative of the kind of usage that a research data file that is accessible, well documented, and well managed as well as preserved, can be expected to have.
Nor do data collected yield up all their research potential once analyzed once. It is important to retain the possibility of re-manipulation of data as the emphasis of research and policy issues also change over time. Previously over-looked elements in a research data file become re-evaluated, and new statistical techniques and means of evaluating research data can lead to important re-analyses of previously analyzed data.
I.2.3 Yet another reason given by the ESRC is the capacity for comparative research, over space and over time, which is provided by access to archived data.
Further to the example used in I.2.2 above, additional surveys that focused on family and social support issues were conducted by Statistics Canada in 1990, 1995 and 1996 in the same GSS series. The 1990 survey had, by 1996 generated over 60 Ph.D. and Master's theses, conference papers, periodical articles, etc. With the addition of these parallel surveys, it becomes possible to study the changes in the family networks and in social support issues (a crucial research area given the aging population) over a period of more than 10 years. Thus the added research dimension of change over time, which is not possible when previous surveys are not preserved.
Canadian election surveys have been conducted since 1965, the latest being held as this is written. This represents a span of over 35 years over which change in the behaviour of the Canadian electorate can be explored. Political science is one of the disciplines which has long been active in the analysis of change over time, largely because of the availability of extensive collections of comparable elections survey in several countries (e.g. the United States, Sweden, etc.) that span time periods of upwards of 50 years, as well as the culture of data sharing that has carried on through successive Canadian election survey teams since the beginning.
Surveys on the health of Canadians are available from the 1978/79 Canada health survey, to the latest version of the National population health survey, a period of over 30 years over which the change in certain characteristics and behaviours can be studied. Had the Canadian illness project data, collected in 1950, still been available, that time span would have been 50 years, a period covering the economic maturation of the parents of the post-war baby boom, as well as almost the entire childhood and maturation of the post-war baby boom generation.
Aggregate Canadian demographic data from the Census of population describing very small geographic areas (enumeration areas) is now available from 1961 on. With the release of small-area data from each successive census, the use of data from previous censuses burgeons, as the time-span for comparative research capability increases. Without this long-term capacity for analysis of small area data, research for example on the migration of elderly populations out of the cities and into rural areas with the attendant impact on social support services in those areas, would not have been possible.
Of course, if older data files are no longer available, or useable, then trend analysis, or the analysis of change over time becomes impossible.
I.2.4 The ESRC green paper [ibid.] also correctly identifies the issues of replicability and verification of research findings.
One of the cornerstones of academic research, as well as of government credibility, is the ethical probity of researchers. However, data fraud occurs, as has been documented in the academic sector.
With appropriate archival requirements and provisions for data preservation, it becomes much more difficult to perpetrate fraudulent research, and articles such as that by Paul Kaihla entitled 'Academe on trial' (Maclean's magazine 107(51):42-46, December 19, 1994) need never be written. Such negative exposure damages the academic sector and its ability to provide meaningful input to social and policy debates.
The ability to access original research data, so as to replicate previous analyses and verify the research conclusions enhances the credibility of both data producers and of researchers. Further, Karl W. Deutsch argues that "The tests for mutual consistency [ie. replication analysis] do not preclude new discoveries. On the contrary, they make new discoveries eventually more likely." (Deutsch, Karl W. The impact of complex data bases on the social sciences. In: Data bases, computers, and the social sciences. Ed. By Ralph L. Bisco. New York: Wiley, 1970).
I.2.5 The final issues identified by ESRC are the use of data to provide methodological insights as well as for teaching purposes.
At those academic institutions which have established data services, in conjunction with the increased availability of older data files from Statistics Canada (through e.g. the Data Liberation Initiative), the trend in academic teaching is on increasingly on more sophisticated analytic techniques, and the development of advanced courses on analysis of complex data, such as longitudinal and panel data, geospatial analysis (GIS), etc. In the long term, this will lead to enhanced researcher capabilities, better quality research, an increased capacity for policy analysis, and a more informed population.
Further, improvements in survey methodology and sampling techniques have been fueled by the analyses made possible of data collected by a variety of methodologies and sampling frames over time.
I.3 Are modest changes to existing institutional policies and mechanisms adequate to meet current and future requirements?
A response to this question flows more naturally from the second question, this issue will be dealt later in this submission.
Part II: What gaps are there in the infrastructure?
From the CAPDU perspective, serious infrastructure gaps exist in three areas : the institutional infrastructure, the policy infrastructure, and the culture of research. Each area will be discussed in turn.
II.1 Institutional gaps
II.1.1. Data are being lost continuously in all three sectors: government, academia, and the commercial sector.
Government sector
The Longitudinal study of immigrants, 1976 wave, was collected by Employment and Immigration Canada and was intended to provide the basis for a systematic comparison of the work experiences of men and women, and of immigrants from different countries and with different characteristics. It is effectively lost, since the originating Department no longer has a copy, and the only known copy, in the National Archives of Canada, is no longer useable. Similarly a number of data files are no longer available from Statistics Canada, for a variety of reasons, such as enumeration area level map files from the 1986 census of population, the Smoking habits of Canadians surveys from the early 1970s, and the 1991 follow-up survey of 1986 university and college graduates. Statistics Canada, although the single largest data producer in the social science disciplines in Canada, has never received funding to provide for the long-term preservation and archiving of the research data it collects, and therefore has no preservation program for these data. Those historic research data files that are still available from Statistics Canada are available more through serendipity than design.
The situation is similar in almost all federal and provincial government departments. Funding for archival preservation of research data is not available in the current economic. The only federal government departments known to actively archive research data are Canada Centre for Remote Sensing (CCRS), Information Systems Directorate of Health Canada, the Marine Environment Data Service of Fisheries and Oceans Canada, the Geological Survey of Canada, and Environment Canada. Although all these departments are providing some form of long-terms preservation of data, little is known of what if any standards and policies regarding formats, media, and metadata are followed in each case. Common to all is that their mandates are very restrictive - none has a mandate or the resources to serve as a national data archive.
Academic sector
There is no single recognized data archive, for research data collected in the academic sector, nor empowered to liaise with all three Research Council (SSHRC, MRC and NSERC), as well as with the Research Offices of the universities in Canada to ensure that data collected with academic research funds is made available for secondary analysis for the long term.
Most of the academic sector data services in Canada have been created in the course of the past five years. All are small, with a mandate to serve their own local research community, or, in a few cases, the research communities of a group of related local institutions. All have insufficient resources to provide the labour intensive work necessary for long-term preservation of data files. For example, the University of Toronto Data Library Service has a back-log of between 350 and 500 original research data files awaiting processing, and insufficient resources (hardware or manpower) to actually process these files. This collection may include a copy of the Canadian illness survey of 1950, but it may no longer be possible to salvage that data file. Currently, the available resources are needed entirely to meet the day-to-day demands of local data users.
Commercial sector
In the commercial sector, there is also no recognized body to which commercial sector produced data files that may be used for secondary analysis can be sent. The commercial sector collects significant amounts of research data, such as public opinion polls, market research data, digital aerial photos, etc. Most of these products are regarded as commercial products, and when they cease to have commercial value, cease to be maintained. For example, all Canadian Institute of Public Opinion (Canadian Gallup) polls prior to number 142 (conducted in May of 1945) and most prior to number 227 (May 1953) are known to be lost due to the lack of an archiving facility. A similar fate is likely to befall current public opinion polls, since only a few of the polling companies currently have the foresight to deposit some of their data files with an institution dedicated to long-term preservation (such as the Roper Center in the United States).
These examples illustrate the need for a central institution with the mandate and cachet to negotiate with government, with major academic research funding bodies, and with the commercial sector. There is equally a need for a central institution to coordinate preservation cohesion, coordinate initiatives, set standards for management and preservation of research data, ensure quality control, salvage research data in imminent danger of being lost, and cope with the new demands of the future.
II.1.2 Major changes in technology (computing hardware and software) occur with increasing frequency and exacerbate problems of preservation. Because data files are recorded on a physical medium, and increasingly, in today's technological environment, in some software-dependant format, data files are subject to loss due to technological obsolescence caused by changes in either the hardware or the software available at any given time.
Long-term preservation of computer-readable research data consists of active (rather than passive) management of three components: the data themselves, the physical media on which they are stored, and the metadata describing them. This includes a continuous awareness of the impact of technological change to ensure that the data are migrated in an accurate and timely way to new formats as they are developed. Lack of rigorous management in any of these areas will result in the irretrievable loss of a data file.
For example, the only known extant copy of the Longitudinal study of immigrants, 1976 wave (see II.1.1 above), is available from National Archives Canada. However, although the file is physically readable, it is in a software-dependant format readable only by 15+ year old version of SPSS, and therefore not usable today. It is effectively lost. Software-dependant formats have a very short shelf life (< 5 years).
The Canadian illness survey, 1950 was stored on 7-track tapes, the state-of-the-art storage medium at the time. These tapes have not been stored under appropriate physical conditions, and are in all likelihood no longer readable, that is, if one can locate a tape drive capable of reading 7-track tapes. The physical media on which computer-readable data are stored have a relatively short shelf-life (about 5-10 years).
For some of the early Canadian Gallup polls, the data themselves have survived, but no questionnaire (metadata) has survived. Without a column by column guide to the variables in the data file and how each variable is coded, the data are undecipherable. Thus, in conjunction with the short shelf-life of physical storage media, and the shorter shelf-life of software-dependant formats, the window of opportunity within which research data can be archived is quite small.
When data are lost, so is a part of the historical, social, geographical, and cultural context of Canada at that point in time. This context may well be extremely valuable at some point in the future, to the understanding of the development of this country, and to the planning of future strategies and policy.
Data preservation activities in Canada are hampered by a lack of resources, both of personnel and hardware and software, as well as the lack of rigorous application of standards and quality control to the processes needed to ensure long-term preservation.
II.1.3. The National Archives of Canada
The National Archives of Canada Act mandates the National Archives (NAC) to preserve and facilitate access to 'private and public records', including those in computer-readable form. Although this mandate would seem broad enough to allow for the NAC to function as a national data archive, the NAC's scheduling criteria, and lack of resources and expertise in this area have the effect of excluding research data from their collections.
'Records' are construed by NAC to be unpublished, or not for sale. 'Records' that are available for sale are considered to be published, and therefore deemed to fall outside the mandate of the NAC. Due to the cost-recovery policies of government agencies (both provincial and federal), most research data products are available for a price, and therefore deemed not to fall in the purview of the NAC but in that of the National Library of Canada.
Further, the appraisal principles (i.e. selection criteria) outlined by NAC for the scheduling of records to be transferred to the NAC focus on national policy-making processes, i.e. the information products which underpin the formulation of policy (such as reports generated from data files) rather than the research data from which the reports are produced. Ian Wilson, the National Archivist, admitted at the General Stakeholders Meeting on October 2, 2000, that the NAC "concentrates on the record rather than the data set".
As a direct result of the reorganization of the NAC in 1987, which resulted in the dissolution of the MRA, the highly trained and specialized personnel of the MRA were scattered among a number of NAC divisions. The personnel of the MRA were at that time the largest single collection of persons specialized in the management, and preservation of research data in Canada. This unit constituted the largest 'think-tank' on data management, preservation, and access issues in Canada, and they produced some excellent work. Those personnel that remain have been doing other work for about 15 years, their skills are over a decade behind, and the lone archivist valiantly attempting to cope with what remains of the collection amassed by the MRA lacks the resources to compensate for the MRA dissolution.
II.1.4 The National Library
The National Library Act mandates the National Library to preserve the published heritage of the nation. So that, for example, book publishers are enjoined to send two copies of every book published to the National Library.
Late in 1995, the National Library began an Electronic publications pilot project (EPPP), as described in the Final report produced in June of 1996 (<http://www.nlc-bnc.ca/pubs/abs/eppp/e-report.pdf>) followed by a later Networked electronic publications policy and guidelines document produced in October of 1998 (<http://www.nlc-bnc.ca/pubs/irm/eneppg.htm>). The thrust of the project is to preserve the Canadian published heritage, also when the publications are in a computer-readable format. Criteria for inclusion in the project include "document[s] which have the characteristics of traditional publications such as undergoing formal preparation activities traditionally associated with the publishing process". Clearly, the NLC's project focuses on information products rather than on data products, as discussed in the introduction. It is noteworthy that the NLC has concluded that:
Experience to this time indicates that preservation of electronic publications cannot be left to electronic publishers, just as preservation of publications in other media is not left to publishers. Long-term preservation means ensuring that a publication survives long after copyright has expired and any archiving activity of the copyright holder or publisher has ceased. By acquiring an electronic publication from the originator as soon as it becomes published, the NLC is assured of preserving the integrity of a publication as originally released. The Library is also able to verify and ensure that the electronic publication is in a form that is readable by standard software and therefore accessible for current and future generations of readers and researchers.
(Source:b <http://www.nlc-bnc.ca/pubs/irm/eneppg.htm>)
The criteria for the management and preservation of published electronic documents are very different from those required for the preservation of research data files. Nonetheless, this does not reduce the labour required to ensure their long-term preservation.
The National Library is very aware that access does not equal preservation. Further, since access frequently implies provision of a product in a software dependant format for optimum use, or efficient location, and since software-dependant formats have a very short shelf life, then it follows that access is indeed the antithesis of preservation.
The National Library does not collect research data files from government, even though as discussed above, these are priced and therefore deemed to be published. Nonetheless they are currently deemed to be outside the purview of the National Library's collections, since they have not undergone the formal preparation activities association with traditional publications. The resources and expertise required for the management and preservation of research data files are very different from those required for the management and preservation of the traditional publications in electronic form of the NLC project.
Thus neither the National Archives nor the National Library have either the infrastructure or a clear mandate to act as a national data archive.
II.1.5 Academic data services and data archives
Beginning as early as the mid-1960s, two Canadian universities established data service/archive centres, namely Carleton University and York University, and a few additional facilities were established in the 1970s. Not until the 1980s, however, was there any significant growth in the number of data service facilities. At the present time, almost all major Canadian universities have a data service of some sort. However, very few have the resources or the skills to act also as data archives. Those that do act as archives, have no coordinated sets of standards for data or metadata, and no union catalogue, and practices and services differ widely from institution to institution.
The data service and data archive facilities currently in universities in Canada are normally mandated to provide services to their own academic institutions or on a collaborative basis to a small group of local academic institutions. None of these services has the mandate or the resources to effectively provide data management and preservation at a national level.
They are uniformly under-resourced as to budget, personnel, space, and computing infrastructure. Working in relative isolation with at best annual meetings with colleagues, the tendency is rather to towards spontaneous, duplicative mini-projects (e.g. six individual data-extractor interfaces developed in Canada alone), rather than toward the type of large coordinated projects that achieves results. The minimal resources of these local academic data archives quickly become swamped during certain seasons of the academic year.
Unfortunately, except for the brief period approximately 1974 through 1986, these were and continue at present to be the only archives available for academic and commercial sector data, as well as the bulk of the government sector data produced in Canada.
II.1.6 International ramifications
The lack or a national data archive has additional ramifications at the international level.
Because Canada has no national data archive, or equivalent, Canada has missed out on participation in international data sharing agreements, This not only limits the capacity for research in Canada, but also means Canada has no voice in how international arrangements are formulated and managed.
For example, 16 countries, including Germany, Australia, the United Kingdom, etc. , have national memberships in the Inter-University Consortium for Political and Social Research (ICPSR), which ensures that all their academic institutions have access to ICPSR data. Canada is treated as a part of the United States, has two regional memberships, of a total of about 30 member institutions, but the remaining Canadian academic institutions have no access to ICPSR data other than by direct purchase.
The NESSTAR project (<http://www.nesstar.org/>) is an international project spearheaded by three national data archives in Europe, with additional partners in academia and the commercial sector. The objective of the project is to develop a data extractor and data analyzer interface to simplify the analysis of research data for the non-specialist user. With a national data archive, Canada might be at the table, so as to influence the development of the interface in a direction that is more useable in Canada (it is now tailored primarily to the hardware/software environments prevalent in Europe). Without the appropriate resources to bring to the table, Canada will have no influence over
II.2 Policy gaps
II.2.1 National information policy
Canada currently lacks a national information policy, as well as having only an incomplete national information infrastructure.
National information policy in Canada is currently only promulgated by Treasury Board. Treasury Board's mandate is to set information policy for the federal government only. However, since the bulk of research data collected in Canada (especially in the social sciences) is collected by federal government agencies, the policies of Treasury Board de facto become, at least as they affect information and research data, the information policies of the nation.
Unfortunately, since there is no mechanism for public consultation, the policies set by Treasury Board in no way reflect the needs of the non-government sectors. Nor has an information policy been on the platform of any political party in a federal or provincial election in the past half-century.
The lack of a national information policy is clearly beyond the scope of the present Consultation, however, it is a fundamental policy gap which is at the root of all other current infrastructure gaps discussed here.
II.2.2. Government cost recovery policy
The cost recovery policies of government create the illusion that computer files are like traditional publications (books, periodicals, cd-roms, etc.) and can/therefore are replicated almost infinitely. While in reality the infrastructure of 'publishing' computer files lack all the overt and covert infrastructure of traditional published media, such as libraries, copyright deposit requirements, book stores, antiquarian book dealers and used book stores, etc. which contribute to the preservation of books and periodicals.
It is also important to note that government cost recovery policies are applied indiscriminately, to information products as well as to data products, and to data-related services such as custom tabulations. In this context it is interesting to note that, according to an informal communication, Statistics Canada derives an annual income of approximately $10 million dollars from the sale of data and information products. Of that, approximately $4 million is from the sale of traditional publications, $1 million is from the sale of access to a major time-series database of socio-economic information, over $4.77 million is from the production of custom information products, and a mere $0.23 million is from the sale of research data products. Thus, a change is the policy regarding the pricing of research data products to increase their accessibility outside the academic sector would likely have a minimal impact on Statistics Canada's revenues.
II.2.3 Conflicting research council policies.
SSHRC is the only national research funding agency currently to require deposit of research data collected with SSHRC funding (<http://www.sshrc.ca/english/programinfo/policies/repositories.htm>):
SSHRC requires that data collected with its assistance, including machine-readable files or computer databases, become public property and be made available for use by others within a reasonable period of time, on condition that confidentiality of information and right to privacy are protected. Consequently, it also requires that the institution of the principal investigator or any other institution which becomes the repository of the data, take the necessary steps to preserve the data and facilitate its accessibility to researchers.
Unfortunately, this policy is not enforced, with the result that very little council funded data actually is ever deposited in existing Canadian data archives.
On the other hand, the Tri-Council Policy Statement: Ethical conduct for research involving humans (<http://www.nserc.ca/programs/ethics/english/index.htm>) is interpreted as requiring the destruction of data on human subjects, and restricting its availability for secondary analysis, while the policy itself clearly states that it is the personally identifiable information (name, SIN, etc.) which must be very carefully restricted or destroyed, but that anonymized data can be made available for secondary analysis. Therefore there is a need to make the Councils and researchers aware of the serious ramifications of the Tri-Council policy, as well as a need to provide training on anonymization standards and techniques to ensure that valuable data are not mistakenly destroyed.
II.2.4 Federal government policy regarding contract data.
Federal government policy regarding contract data is also problematic. Prior to 1991 policy regarding contract data seems to have been determined on a department by department basis, ie some departments demanded the research data file as part of the standard contract deliverables, other departments did not. In 1991, however, Treasury Board produced a new policy on Ownership of intellectual property in government contracts, which states:
When reviewing intellectual property aspects in preparation for the award of a contract involving R&D, departments are to start with the presumption that contractors will take title to intellectual property.
This policy effectively states, that if for example Canadian Facts conducts a survey for a government department, Canadian Facts retains the intellectual property rights to the data collected. This attitude to the data product, as opposed to the report, is unfortunately in line with other government priorities, which assign little or no value to the raw data, but do ascribe value to the written reports based on the data.
II.3 Culture gaps
II.3.1 Definition of culture gaps for purpose of this submission
Infrastructures consist not only of institutions, and of written policies, but of something more intangible that is often described as 'common practice' or simply 'the way it's done'. The more intangible long-term effects of a change in government policy from one of 'government information is a public good' to 'government information is a commodity', a basic fundamental change in government philosophy that took place between 1984-1986. If the federal government regards information as a commodity, then, since all other sectors are heavy consumers of government information, they of necessity begin to regard their information as a commodity as well.
II.3.2 Lack of research culture in most disciplines that considers future replicability of the research as part of the design process.
In 1976 the Canada Council published a report Survey research : report of the Consultative Group on Survey Research. A substantial portion of that opus recommended the adoption of standard classifications for demographic variables, to improve the comparative analysis potential of data in Canada. Clearly the recommended coding schemas would need to be updated, but the idea is sound, that if all research data collected codes key variables in the same way, then the capability for comparative research is greatly enhanced.
However, this culture, of enhancing replicability and comparative research seems to be missing in much of Canadian data collection today, with the exception of the Canadian election surveys series, and some of the public use microdata files produced by Statistics Canada from for example the quinquennial census of population, the survey of consumer finances, the labour force survey, etc. Although Statistics Canada is addressing the problems of data comparability (e.g. through its Integrated Metadata Base initiative) this is only one of the many data producing agencies that needs to be involved in such an effort.
With an increased culture of encouraged replicability come long-term benefits in terms of increased capacity of comparative research.
II.3.3 Lack of tradition and training for data preservation beginning at the graduate research training stage.
While on the one hand graduate students often learn statistical analysis by doing secondary analysis of previously collected data, they are not encouraged or required to actively participate in that culture of data sharing in terms of their own data. Universities should be requiring that original research data collected in the process of graduate research should be deposited in the same way that graduate theses and dissertations are required to be deposited as a condition of the degree-granting process. Such a requirement would encourage researchers, from an early stage, to regard their original research data, as well as their reputations as researchers, as something to be enhanced by the sharing of data, rather than as something that in some way is diminished by sharing.
Cases such as that of Dr. Roger Poisson (Correspondence, New England journal of medicine 330(20):1458-1462, May 19, 1994 <http://www.nejm.org/content/1994/0330/0020/1458.asp>) are reflections on inadequate training in research ethics. Ideally, all empirical research should expect as a matter of course that the data used in analysis be widely available to independant replication and re-analysis.
II.3.4. Research data are undervalued.
The importance of previously-collected data is traditionally undervalued, both by producers and by researchers, especially in disciplines which have little experience with analyzing change over time, and in disciplines which rely on experimental techniques, e.g. psychology.
Due to current policy and economic climates, there is no norm or culture of automatic data deposit. A national data archive would become a force for promoting the culture of archiving data among researchers.
However, inter-disciplinary research, and comparative research, are increasing in importance, as is the concomitant demand for data from a variety of times and/or spaces. Most data that have been collected in the past can never be replicated, and once lost, our ability to compare the present moment in time with those past is lost permanently.
II.3.5 Are modest changes to existing institutional policies and mechanisms adequate to meet current and future requirements?
So, in response to the question posed in Part 1, CAPDU's answer is simply 'no'. Any changes to existing policies and mechanisms which might be adequate to meet even current requirements must be wide-ranging, and incorporate substantial resources.
It is our submission that the infrastructure required to provide long-term preservation of significant research data in Canada requires:
- an institutional mandate that includes archiving of data collected by government, academia, and commercial sectors,
- an institutional mandate to disseminate or provide access to research data in its collections to ensure that they are most efficiently used for secondary analysis
- an institutional mandate that recognizes the current and future potential for cross-disciplinary research encompassing many kinds and sources of data,
- an institutional mandate that complements those of the National Archives and National Library so as to maximize the preservation of research and information resources in Canada,
- an institutional mandate to lobby government at all levels, and the academic sector for needed revisions to legislation as well as policy,
- an institutional mandate empowering the national data archive to negotiate for Canada in data-related international policy, standards, and data exchange activities and initiatives,
- the resources to provide the level of activity needed to preserve research data (in all disciplines) in perpetuity.
Part III: Who will benefit from the improved management of research data and to what degree? How will effective research data management, preservation and access contribute to Canadian research capacity?
These two questions are in reality merely two aspects of the same question, and therefore are in this submission dealt with jointly.
III.1. Who benefits? And how would a national data archive contribute to better research capacity?
Data producers, from the development of reasonable standards vis-a-vis formats, storage media, metadata, authenticity standards, access standards and infrastructures, etc., and known policies, practices, and venues for preservation, which will release the producer from the burden of having to consider whether or not to preserve the data, and they deem the data should be archived, provide the necessary infrastructure. Also through efficiencies in data usage and the development of a trained cadre of knowledge workers.
Research funding bodies at all levels from the development of commonly accepted standards and criteria. Also from improved capabilities for comparative research, inter-disciplinary research, methodological advancements, etc.
Academic researchers in the emergence of a culture that values research data as a valuable raw material, and not as effluent after the research process is completed. This leads naturally to an appreciation of the effort involved in creating clean, well-documented data, and the academic rewards that accrue thereto. Also from access to better data and broadly enhanced capabilities for comparative research. And from a 'cleaner' reputation.
Researchers inside and outside the academic sector in improved capabilities for research, both longitudinal and cross-sectional, as well as better trained employees able to correctly interpret in-house and external data and realize opportunities, suggest policies, etc. Thus, a more vibrant, and more efficient economy. Enhanced opportunities for the information economy.
Students from access to better data, enhanced capabilities for research, and better trained and more knowledgeable faculty.
Policy-makers from better training, better analyses of data, greater insights into social and economic conditions from better comparative research.
And the public from more pertinent policies, and better research for ultimately a smaller expenditure, improved Canadian competitiveness through international collaboration and increased knowledge generation.
Acknowledgements:
Contributors to this submission include:
Richard Boily Université de Québec à Rimouski,
Susan Czarnocki, McGill University,
Suzette Giles, Ryerson Polytechnic University,
Elizabeth Hamilton, University of New Brunswick,
Susan Jackson, Carleton University,
Anastassia Khouri, McGill University,
Jeffrey Moon, Queens University,
Sharon Neary, University of Calgary,
Walter Piovesan, Simon Fraser University,
Author of the mistakes: Laine Ruus, University of Toronto
(c:\laine\capdu\ndac\ndac_sub1b.doc)