RD-Connect’s primary objectives are to develop:

  • an integrated platform to host and analyze genomic and clinical data from research projects;

  • clinical bioinformatics tools for analysis and integration of molecular and clinical data to discover new disease genes, pathways, and therapeutic targets;

  • common infrastructures and data elements for rare disease patient registries;

  • common standards and catalogue for rare disease biobanks; and

  • best ethical practices and a proposal for a regulatory framework for linking medical and personal data related to rare disease.

RD-Connect will initially incorporate data generated by two associated projects: EURenOmics, which uses multiple approaches to focus on causes, diagnostics, biomarkers, and disease models for rare kidney disorders such as steroid-resistant nephrotic syndrome and tubulopathies; and NeurOmics, which uses novel molecular approaches to improve diagnosis and treatment of rare neurodegenerative and neuromuscular disorders such as Huntington’s disease and muscular dystrophies. RD-Connect unites existing infrastructures and integrates the latest tools to create a comprehensive platform for biobanking, data analysis, and patient registry for researchers across the world.

THE RARE DISEASE ENVIRONMENT AND THE IMPACT OF NEW TECHNOLOGIES

Because 80 % of rare diseases are thought to have a genetic component, particular emphasis has been placed on the prospects offered by the rapidly expanding development of new technologies such as genomics, transcriptomics, metabolomics, and proteomics in rare disease research.1,2 These technologies offer new paths to identify novel disease and modifier genes, delineate biomarkers, and identify therapeutic targets. However, concrete achievements from these personalized and stratified medicine approaches in rare disease have been limited, particularly when it comes to translation to therapies and the clinic.3,4

The integration of the outputs of these new technologies with detailed clinical phenotype data and the combination of data across centers and across diseases is crucial for further progress. While such integrative efforts are ongoing within some medical centers, individual efforts often remain largely siloed. This is a critical problem in rare disease studies, where a given center may see only a small number of patients with a certain disease. Linking such data sets across centers and across diseases is thus an essential step. Outside the rare disease field, a number of major research infrastructures, including the International Cancer Genome Consortium and the International Human Epigenome Consortium, have shown the feasibility of robust tools for large-scale data and sample sharing across multiple research projects.5

To address this issue, major medical research funders have come together in a global effort to foster collaboration in rare disease research. The International Rare Diseases Research Consortium (IRDiRC) was launched in 2011 and now has 35 members worldwide, including the European Commission as well as key national funders, such as several institutes in the US National Institutes of Health.6 Each of these funders has pledged to spend a minimum of US$10 million on rare disease research over 5 years. The IRDiRC has set itself two headline goals to achieve by 2020: 1) to develop the means to diagnose most rare diseases and 2) to deliver 200 new therapies for rare diseases.

In this review, we provide an overview of the objectives and initial achievements of one of the first projects to be funded under the IRDiRC. Initiated in 2012, RD-Connect is a €12 million (US$19 million) IRDiRC infrastructure project funded by the European Union’s Seventh Framework Programme.7 The project brings together 27 partner institutions and works in close collaboration with two associated research projects, Neuromics (www.rd-neuromics.eu) and EURenOmics (www.eurenomics.eu) (see Table 1). Its objectives for the 6-year funding period are to develop an integrated platform connecting databases, registries, biobanks, and clinical bioinformatics for rare disease research and to contribute to the IRDiRC goals by facilitating gene discovery, diagnostics, and therapy development.

Table 1 Rare Neuromuscular, Neurodegenerative, and Renal Diseases: Initial Contributing Projects

DEVELOPMENT OF A UNIFIED INFORMATICS PLATFORM FOR DATA SHARING AND ANALYSIS

Key data management challenges faced by rare disease research projects include the high complexity and heterogeneity of the data types involved, the variability among experimental platforms,8 the need to incorporate non-standardized sample descriptions and diverse data types, and the requirement to protect data (both pre-publication data and identifiable patient information). Furthermore, the high volume of data and the distributed nature of the sources make traditional approaches to data management impractical, and new solutions are therefore required.

One of the primary goals of RD-Connect is to help contributing projects make their data rapidly available to the wider rare disease research community. Raw genomic data from collaborating projects is first deposited in the European Genome-phenome Archive at the European Bioinformatics Institute, a permanent resource for controlled-access archiving,9 then reprocessed in a standard manner to ensure cross-compatibility. The processed data is held in the central RD-Connect database, where it is combined with phenotypic and biomaterial information. Researchers approved by a data access committee gain secure access to an integrated data portal that enables comparison of data sets across projects; data can be explored and analyzed using a comprehensive suite of bioinformatics tools (Fig. 1).

Figure 1
figure 1

Overview of the RD-Connect integrated platform. Raw genomic data from collaborating projects, Neuromics, EURenOmics, and other IRDiRC-supported projects will be securely deposited in the European Genome-phenome Archive before being processed through a standard pipeline to ensure cross-compatibility. The processed data will be held in the central RD-Connect Data Coordination Centre, where it will be combined with other data types plus phenotypic and biomaterial information. Researchers approved by a data access committee will access data through a secure online interface that enables comparison of data sets across projects and analysis with sophisticated bioinformatics tools.

In addition to the technical aspects, any project involving aggregation of data has to contend with challenges relating to the willingness of researchers to share their results. Here, RD-Connect has the advantage that its two associated research projects, Neuromics and EURenOmics (Table 1), have been pilot users of the system. Thus, key policies such as implementation of staged embargo periods to protect results temporarily have been developed in close collaboration with these end users. Within the first year, this partnership has borne fruit: the first data sets have already been transferred, and time lines for release of the remainder have been approved. Additional ways of meeting the data-sharing challenge include initiatives such as microattributions10,11 and nanopublications12 to stimulate and reward data-sharing and knowledge-sharing and integration, with an explicit focus on interoperability, as opposed to massive centralization. Although it is the intention of RD-Connect to act as a central repository for IRDiRC research data, it will also be able to integrate with existing databases operating in the same field, such as the DECIPHER database of genomic variation data based at the UK Sanger Institute.13

Close integration with existing research resources such as those developed by the European Bioinformatics Institute, as well as the major European research infrastructures, including the Biobanking and Biomolecular Resources Research Infrastructure14 (BBMRI; http://bbmri.eu) and the European life-sciences infrastructure for biological Information15 (ELIXIR; http://elixir-europe.org), and important global developments in data sharing such as the Global Alliance for Genomics and Health16 (GA4GH; http://genomicsandhealth.org), are also crucial in this regard.

DATA ANALYSIS AND PROCESSING: CREATION OF NEW BIOINFORMATICS TOOLS

The multi-dimensional data compiled at the RD-Connect central hub affords an opportunity to analyze genetic etiology, functional molecular profiles, and patient-level phenotypes in relation to rare disease. Appropriate analysis tools are applied to this data with the immediate objective of generating a more complete picture of rare disease causes and mechanisms from the molecular to the physiological level, towards the eventual goal of improved diagnostics and therapy. Thus, RD-Connect will adapt, develop, and apply new bioinformatics tools to this uniquely rich data set. These tools will be integrated into the RD-Connect platform, piloted in analyses in collaboration with investigators from the Neuromics and EURenOmics projects, and be made broadly available in open-source release to the wider community for use in other projects.

A major challenge is to adapt existing sources of knowledge and methods of analysis to the scale of whole-exome and whole-genome sequencing or simultaneous characterization of thousands of transcripts and proteins. Linking data from different sources also requires the creation of novel approaches, as does the application of these results to clinical translation such as therapy selection, or to the identification of new therapeutic targets. Although the impact of pathogenic genetic variants on transcripts, proteins, and pathways can often be predicted based on genetic information, it is becoming increasingly clear that disease phenotypes are strongly influenced by additional genetic, epigenetic, and environmental factors. Therefore, large-scale data related to epigenomics, transcriptomics, proteomics, metabolomics, lipidomics, glycomics, phenomics, and secretomics need to be processed and made available to the rare disease field in a similar way to traditional genetic analysis. RD-Connect has begun developing methods to combine these data to facilitate gene and biomarker discovery.

The RD-Connect platform will enable a range of bioinformatics tools to be utilized on data held within the system—both tools that are being further developed within RD-Connect, and the related projects and external tools that can be linked in through common APIs (Application Programming Interface) and web services. These include variant interpretation and pathogenicity prediction systems, variant/phenotypic “matchmaking” tools, and integrative analysis tools using semantic web applications and frameworks, for improved data integration and access to knowledge.1724

CROSS-LINKING DATA IN DATABASES AND BIORESOURCES

Historically there has been very limited cross-linking between biomaterial collections, registries, genomics, and trial data, with the exception of individual clinical research centers, where all the information may be held by a single investigator. Unfortunately, the more common situation is that the same patient is associated with multiple entries in different systems, with extensive phenotypic information available in a registry, biosamples available in a biobank, and genomic data in a research database—but without any possibility of linking the data sets. This is clearly suboptimal and at best can result in much duplication of effort, and at worst, missed opportunities for discovery, diagnosis, or treatment. Solving this data-linking problem is made more challenging by the need for strong data protection to ensure patient confidentiality. Furthermore, owing to the rarity of the conditions, rare disease patients are more likely than others to have data in cross-border repositories, and an international solution to this problem is therefore essential.

Other projects facing similar issues have evaluated solutions involving the generation of a “globally unique identifier” (GUID) from a set of personally identifiable information associated with the research participant. This allows data on a single individual to be accumulated across projects over time, regardless of where and when the data was collected, and enables researchers to define a study population using data collected in different centers. In the United States, the National Institutes of Health (NIH)-funded National Database for Autism Research (NDAR)25 has developed an identification system that is now being extended under the auspices of the NIH Office of Rare Diseases Research for use in linking patient clinical information with biospecimens. Similarly, a unique identifier for Huntington’s Disease patients taking part in international research studies has been developed.26 RD-Connect intends to implement such a system prospectively on a voluntary basis across participating registries and biobanks, to allow data sets to be cross-linked in full compliance with current data-protection policies. However, to ensure that no data from contributing projects has to be excluded, the GUID will not be a prerequisite for entry of data into the system.

GENERATION OF A COMPREHENSIVE, SEARCHABLE, ONLINE CATALOGUE FOR HUMAN RARE DISEASE BIOMATERIALS

Making biological materials and associated data from rare disease patients accessible and available to the scientific community in an internationally coordinated manner is a core aim of the RD-Connect platform. This initiative is based on collaboration between the two major relevant biobanking infrastructures in Europe: BBMRI, which historically has focused primarily on population biobanks, and EuroBioBank.27 EuroBioBank has historically been composed principally of neuromuscular biobanks, but will extend to incorporate rare disease biobanks of all types.

Within the first year, RD Connect investigators have begun a mapping exercise to ensure outreach to all biobanks holding biomaterials related to rare diseases. Progress has also been made towards the development of a new online interface for a rare disease biomaterial catalogue, including primary cells, tissue, DNA, serum, RNA, and human-induced pluripotent stem cell lines. This will also include information related to biobanking standards, including an overview of major existing standards and guidelines for biobanking and “Minimal Information Standards” for sample collections.

The design of existing databases, registries, and biobanks does not often allow the sharing of information in a computer-accessible fashion. For this reason, participating resources are being offered the opportunity to make use of linked data and semantic web approaches: computational standards and methods that enable them to make their data more accessible and interoperable.28

USE AND FURTHER DEVELOPMENT OF PHENOTYPE ONTOLOGIES

It is increasingly recognized that advances in sequencing technology do not replace the need for detailed clinical assessment of patients with rare disease. On the contrary, deep phenotyping is more important than ever in order to interpret whole-exome and whole-genome sequencing results. However, where clinical notes are on paper systems in hospitals, or where clinicians enter free text in electronic systems, the power of computation cannot be leveraged to support analysis. Ontologies are structured representations of knowledge using a standardized, controlled vocabulary for data integration, organization, searching, and analysis. To ensure the searchability of the data in the central system, RD-Connect makes use of ontologies of both phenotypic features (signs and symptoms of diseases) and diseases and disease groups (disease classifications or nosologies).

With over 10,000 classes (terms) describing human phenotypic abnormalities and over 13,000 subclass relations between the classes, together with extensive annotation and cross-referencing with other ontologies, the Human Phenotype Ontology (HPO)29 is a leading example of a phenotypic ontology. Phenotypic data contributed by the associated research projects, Neuromics and EURenOmics, is being captured using the HPO, enabling standardized cross-cohort comparisons and filtering, as well as implementation of algorithms for automated “matchmaking” to help find cases with clinically similar presentations and variants in the same gene. Collaboration between RD-Connect, Neuromics, and the developers of the PhenoTips30 software has enabled development of user-friendly online forms for clinicians to enter disease-specific phenotypic information for neuromuscular and neurodegenerative diseases using the HPO. As part of the collaboration with HPO developers, terminology workshops with expert clinicians are planned to augment the HPO (still under active expansion) with further phenotypic classes. For rare disease classification, RD-Connect will also use the Orphanet Rare Disease Ontology,31 a nosology system cross-referenced with ICD10, HPO, and other systems.

GENERATION OF COMMON DATA ELEMENTS AND STANDARD OPERATING PROCEDURES FOR PATIENT REGISTRIES

To enable standardized aggregation of patient information, in addition to the use of standardized ontologies, it is helpful to establish common data elements for data collection. Work in this area builds on existing protocols and best-practice recommendations produced by the leading database and registry initiatives represented within RD-Connect, including the disease-specific networks established for neuromuscular diseases,32 cystic fibrosis,33 and Huntington’s disease,34 the recommendations of the European Platform for Rare Disease Registries project, and the rare disease common data elements developed and published by the NIH’s Office of Rare Diseases Research,35 all of which will be leveraged to provide an initial framework for participating registries. Best practices for information collection by biomaterial collections have been developed by established biobanks such as EuroBioBank,36 the Telethon Network of Genetic Biobanks,37 and BBMRI, and their further development will also be promoted by connection with the BioMedBridges initiative, whose mission statement includes “Building data bridges and services between biological and medical infrastructures in Europe”, and which comprises the ten biological and medical sciences research infrastructures selected by the European Strategic Forum for Research Infrastructures.

DEVELOPING SOLUTIONS TO ETHICAL AND REGULATORY ISSUES

Ethical issues surrounding sharing of sensitive personal data have to be dealt with robustly in a project of this nature. Data-protection and ethics-approval mechanisms must be taken seriously, but solutions must not excessively hinder research. In rare diseases, this is a particular issue when differences in national procedures can pose significant barriers. For example, in a recent trial for juvenile dermatomyositis, the participation of 103 clinical centers was needed to recruit 130 patients, and the ethics approval process took 2 years.

Working closely with the regulators at a European and global level, RD-Connect is developing recommendations to overcome such hurdles, recommending a risk-based approach to ethical review38 that simplifies the process for information-based research. The risks associated with research that uses genetic information, stored biospecimens, or information from databases, medical registries, patient records, and questionnaires are not physical, but informational, e.g., related to unauthorized release of information. In such cases, a more expedient review process is suggested, paving the way for a simplified procedure for data-sharing research. On a practical level, a rare disease data-sharing charter and standardized templates for informed consent procedures are being developed. Patient issues and stakeholder inclusion are also recognized as central, and a patient-centered approach ensures patient views are taken into account across all aspects of the work. This is coordinated by a 16-member Patient Advisory Council made up of patient representatives from associated projects and patient organizations.

CHALLENGES AND ISSUES TO BE ADDRESSED

A number of challenges facing the RD-Connect project are summarized in Table 2. While some of the technical and scientific challenges are specific to this project, others such as economic, ethical, societal, regulatory, and political issues apply to rare disease research in general.

Table 2 Challenges and Issues to Overcome while Developing the RD-Connect Infrastructure

CONCLUSION

Patient registries,39 biobanks,40 and bioinformatics support are key infrastructure tools required for genomic research in rare disease; data sharing and linking of patients, samples, and analysis is also essential. The infrastructure developed by RD-Connect supports research in rare disease to find new genes, biomarkers, and therapeutic targets more quickly and efficiently. Its ultimate goal will be to improve outcomes for rare disease patients via major improvements in diagnostics and therapeutics (Fig. 2). The therapeutics market in rare disease has strong growth potential due to the high (and unmet) medical need for most rare diseases. Genomic research and development will thus be highly relevant for many markets, including genetic testing, biomarkers, and therapeutics. The 2011 Orphanet report on rare disease research41 noted that networking initiatives resulting in easy and secure access to resources (databases and registries, biobanks, reference data sets, and analysis tools) and a close working relationship with patient groups are clear predictors of success for translational efforts in rare disease. As the field evolves and embraces the opportunities of the new technologies, there will be further challenges relating to access to sufficient patient numbers, plus sufficient high-quality biological samples annotated with harmonized ontologies and associated with detailed molecular data, analyzed by standardized analysis pipelines. RD-Connect will enable the elucidation of pathways relevant across rare diseases and identify shared therapeutic targets for groups of rare and common disorders. Ultimately, it will enable cross-linking and efficient distribution of quality-controlled data to the rare disease research community in a secure ethical and legal framework, which is crucial for achieving the IRDiRC goals.

Figure 2
figure 2

Impact of RD-Connect