Abstract

To enable rationale vaccine design, studies of molecular and cellular mechanisms of immune recognition need to be linked with clinical studies in humans. A major challenge in conducting such translational research studies lies in the management and integration of large amounts and various types of data collected from multiple sources. For this purpose, we have established “IMMUNOCAT”, an interactive data management system for the epitope discovery research projects conducted by our group. The system provides functions to store, query, and analyze clinical and experimental data, enabling efficient, systematic, and integrative data management. We demonstrate how IMMUNOCAT is utilized in a large-scale research contract that aims to identify epitopes in common allergens recognized by T cells from human donors, in order to facilitate the rational design of allergy vaccines. At clinical sites, demographic information and disease history of each enrolled donor are captured, followed by results of an allergen skin test and blood draw. At the laboratory site, T cells derived from blood samples are tested for reactivity against a panel of peptides derived from common human allergens. IMMUNOCAT stores results from these T cell assays along with MHC:peptide binding data, results from RAST tests for antibody titers in donor serum, and the respective donor HLA typing results. Through this system, we are able to perform queries and integrated analyses of the various types of data. This provides a case study for the use of bioinformatics and information management techniques to track and analyze data produced in a translational research study aimed at epitope identification.

1. Introduction

A crucial step for rational subunit vaccine design is the selection of antigens to include. For vaccines against infectious agents, antigens capable of inducing protective immune responses are desired. Several strategies based on genomic and proteomic approaches are being used to identify subsets of antigens that are highly expressed in general [1], on the surface [2], or during infection [3]. Antigens from these priority subsets are then followed up individually to test for their capacity to induce protective immunity. An alternative strategy that identifies protective antigens directly is to map targets of immune responses in previously infected hosts that successfully cleared the infection. This strategy is applicable whenever past infectious are known to provide protective immunity. In those cases, the capacity of antigens to induce protective immunity in a vaccine setting has been shown to correlate with the magnitude of the response against that antigen post infection [4]. Therefore, knowledge of targets of immune responses in infected hosts has high value for vaccine design against infectious diseases. Knowing immune response targets is also crucial for the development of allergy vaccines, whose goal is to modulate the pathologic immune responses of allergic individuals towards those found in non-allergics [5, 6]. Similarly, for cancer vaccines to be successful, it is necessary to identify antigens targeted by immune responses associated with tumor regression [7, 8]. In summary, identifying the targets and characteristics of immune responses in well characterized host populations enables the rational design of vaccines.

One established approach for the identification and characterization of T cell immune response is the use of peptide based epitope mapping strategies. These are especially efficient when used in combination with bioinformatics predictions of candidate peptides [9]. The identification of epitopes, the exact molecular unit of recognition within an antigen, also provides a mechanistic understanding of cross-reactivity of immune responses for different pathogens. This has recently been applied to study T cell immunity to swine flu [10, 11], and is important when designing cross-protective vaccines.

We have participated in two recently completed large-scale T cell epitope mapping projects, one to characterize epitopes responsible for the protective immunity conveyed by the smallpox vaccine [1215], another to characterize epitopes in Arenaviruses [1618], which has led to the generation of a candidate for a cross protective vaccine (M. Kotturi et al., PLoS Pathogens, in press). One lesson learned from these studies is that their data management is challenging, as the epitope response patterns discovered are typically complicated [19]. Also, these studies require the integration of large amounts and various types of data collected from multiple clinical and laboratory sites. Like many other groups, we have managed these data in a collection of spreadsheets, lab notebooks, and database systems designed for a single type of experiment. While each of these provides a sufficient mechanism to capture a specific type of information, the integrated analysis of these data often becomes labor intensive. Worse, problems due to inconsistencies in nomenclature and incompleteness of datasets are often only discovered at the time of analysis rather than at the time of data entry, which can make it hard or impossible to rectify them.

One way to address these issues is to collect data, from the start, in an integrated database system which immediately connects data from different sources. While this requires more work upon data entry, compared to less formal means of data capture such as spreadsheets, it greatly reduces the effort for data analysis. A deciding factor for our group to move toward an integrated database system was the award of a large scale NIH contract that aims to identify T cell epitopes from common allergens with the ultimate goal to facilitate the development of improved allergy vaccines. This study involves tracking donors enrolled at two clinical sites, capturing complex clinical history information, and storing results from multiple experiments performed in-house and through external providers (see Figure 1 for a breakdown of the overall study). Our experience with previous studies which were less complex, suggested that the maintenance and analysis of these data would be challenging. We therefore decided to establish IMMUNOCAT, which aims to store all data relevant for an integrated analysis of epitope mapping experiments, with the ultimate goal to enable rational vaccine design.

IMMUNOCAT integrates and enhances existing information management systems in our laboratory. Two stand-alone web-based relational databases were previously developed. ELICAT stores and analyzes results from ELISPOT [20] experiments used to test T cells for their cytokine secretion profile in response to peptide antigens. TOPCAT was designed to store the binding affinities determined in competitive peptide:MHC binding assays [21] utilizing a TopCount instrument. The databases are linked by the use of common peptide identifiers, which are assigned to every peptide that is synthesized in our studies. For the allergen epitope mapping study, we needed to link the results of T cell reactivities stored in ELICAT with information about the donor from whom the T cells were derived. We first built a new web-based database called DONORDB to store answers from a clinical questionnaire, skin test results, and records of blood samples. Subsequently, this donor database was integrated with the existing ELICAT and TOPCAT databases (Figure 2) through the use of shared MHC allele and donation identifiers. The result is IMMUNOCAT, an integration of three web-based databases.

In the following, we describe the features and components of IMMUNOCAT, which has been in stable use in our laboratory for over a year. With this system, we are able to manage all data relevant to our epitope mapping studies in a central location and ensure their quality and integrity. Answers to integrated queries that would previously have taken significant effort can now be determined instantly, such as how many blood samples are available from donors that showed T cell reactivity to a certain peptide. In addition to the allergy epitope mapping study described here, IMMUNOCAT has been modified for use in two recently initiated epitope mapping studies for Dengue Virus and Mycobacterium Tuberculosis. This demonstrates the benefits of applying bioinformatics and information management techniques to epitope mapping studies in order to facilitate rational vaccine development.

2. User Administration

IMMUNOCAT provides different functionality to different groups of users. Each user has an individual account created by a database administrator. In order to access the database through the web, users have to identify themselves by supplying a user name and password. Different functionalities are provided based on the user group assignments. For example, there are two primary groups of users for DONORDB. The first group represents staff members at the clinical sites, who enter information from the clinical questionnaire and results from the skin test and blood draw. The second group is staff members at the laboratory site, who track the use of blood samples for each donor. Different users within each group are assigned different levels of access. For example, lab scientist supervisors have the ability to audit, add, delete, or modify the information in the system as necessary. Other lab scientists can only enter raw data and must request lab scientist supervisors to correct, for example, data entry errors. Separate projects using the DONORDB, such as the epitope mapping studies for Dengue Virus, use different web addresses for user access.

3. System Components and Features

3.1. DONORDB at Clinical Site

DONORDB is used at the clinical sites to capture information as donors are enrolled, interviewed, and undergo clinical procedures. For our ongoing allergen epitope mapping study, demographic information of the donor, such as gender, birthdate, ethnicity and parents’ birthplaces (as an additional way to identify ethnicity), is first entered. Next, a questionnaire is completed based on donor interviews (Figure 3). The questionnaire is tied to a scoring scheme that is used to classify donors into different disease categories, namely, three classifications of “allergic rhinitis” (none, possible, and probable) and “allergic asthma”. In terms of clinical procedures, the results of a prick skin test for allergic reactions against 32 common allergens are recorded in terms of flare and wheal diameters. Also, the hemoglobin count of a donor is measured and recorded to determine if he/she is safely capable of donating blood. If so, a separate blood donation visit is scheduled, and the volume of blood drawn is recorded in the database before samples are shipped for further analysis to the laboratory site.

At every step, the system ensures that donors meet all enrollment criteria, by notifying clinical staff if a donor falls under one of the exclusion criteria based on the entered information. During initial enrollment, exclusion would occur if, for example, the patient states that he is currently receiving allergen immunotherapy without yet being in maintenance. After initial enrollment, exclusion would occur, for example, when the hemoglobin count is too low to allow a blood draw, or when the donor is still taking antihistamine medication right before the skin test. Integrating these questions and criteria into the database makes it easier for the clinical sites to keep track of all aspects of the enrollment criteria, and promotes consistency between sites.

Only anonymized information about the donors is entered at the clinical sites. As donors are enrolled, they get assigned a donation identifier by the system. This identifier is used to match blood samples shipped to the laboratory site with the information collected at the clinical sites. The clinical sites store the donation identifiers in their general patient records, which allow them to, for example, match repeat donations of a patient pre- and post immunotherapy, or match existing patient identifiers established at the individual sites to the DONORDB donation identifies. No personally identifiable information (such as names or social security numbers) is stored in the database, which ensures that all anonymization requirements for data analysis are met.

By default, the data entry process follows the order in which the study was envisioned to be conducted. However, this order can be specifically overruled by users. At any time point, users can save completed data entry forms, and log out. They can log back in, identify the donor they were working on based on the donation ID, and continue data entry. If necessary, users can jump to different entry parts at their convenience. This was identified as critical functionality, as in clinical practice the idealized workflow outlined above is often interrupted due to changes in patients’ schedules, and availability of time for data entry. Similarly, data entry requires internet access which for various reasons may not always be available. We found that enrollment information at clinical sites is therefore often recorded on paper before transferred into the database.

3.2. DONORDB at Laboratory Site

The lab scientists receive blood samples from the clinical sites, and process them to extract peripheral blood mononuclear cells (PBMCs). They use the database to track the availability of PBMCs from each donation, determine the number of vials that could initially be made from the blood shipment, record the use of vials in subsequent experiments, and track the location of the remaining vials in the freezer.

For each donor, data from two experiments performed at outside commercial laboratories are stored directly in DONORDB. The first set of data consists of IgE antibody titers against a panel of allergens determined in a RAST assay. The second set of data consists of the results from HLA typing assays, which determine the specific set of MHC molecules the donor expresses.

Simple queries can be run against DONORDB in a web-based form (Figure 4). At this point, the system provides query functions to retrieve results from skin tests and the receipt of blood samples, as well as to monitor the number of donors in the two clinical sites. For example, lab scientists can query the database and see whether the blood samples of specific donors have been received at LIAI. They can also identify the donors who were skin test positive for a particular allergen and recruited at a specific clinical site. These queries are used routinely by laboratory scientists to select suitable blood samples for experiments.

3.3. ELICAT Database

ELICAT was designed for the management of data from ELISPOT assays. In these assays [13, 18], T cells are stimulated with peptide antigens, and those that respond by producing cytokines are visualized as individual spots. These recognized peptides are potential candidates for inclusion in an allergy vaccine. The experiments are performed on 96-well plates, and the numbers of spots are counted by an automated ELISPOT reader. The raw data generated is exported to a text file containing a matrix whose elements denote the number of spots detected in each well. This raw data is automatically imported into ELICAT, where it is connected with information about experimental design previously entered through the interface shown in Figure 5. An experiment is defined as one or more ELISPOT plates run with the same layout and cells from the same source. The plate layout specifies which wells are used to hold the tested peptides and which wells are used to hold the controls. The peptides used are specified in a separate file containing peptide identifiers.

Based on these inputs, ELICAT calculates summary metrics for the ELISPOT results for each peptide or peptide pool. The fraction of spot forming cells per million is calculated based on the number of spots detected in a well and the number of effector T cells added. Based on replicate spot counts, a P value is calculated that evaluates if the detected spots for a peptide are significantly higher than those detected in negative control wells that contained no peptides. Finally, a stimulation index, which divides the average spot count for a peptide by the average background value, is calculated.

In the allergen epitope mapping study, two sets of ELISPOT experiments are performed to identify peptide epitopes. First, pools of peptides are screened for reactivity with PBMCs from allergic donors. Secondly, the positive pools are deconvoluted to identify which peptides caused the response. As a cutoff for the positive response 100 spot forming cells per million, and stimulation index 2 were used. The ELICAT user interface allows description of each assay, including identification of each individual peptide tested or definition of each peptide component of specific pools.

3.4. TOPCAT Database

TOPCAT was the first database established in the laboratory, and its design was largely replicated in ELICAT. TOPCAT is used to store results from assays that evaluate MHC:peptide binding affinity, a necessary, but not sufficient, requirement for T cell recognition [21]. In our assays [26, 27], the ability of the tested peptides at different concentrations to completely inhibit the binding of a radiolabeled high affinity ligand is determined. The concentration of the tested peptide at which the number of labeled ligands bound is reduced by 50% is the IC50 value. Under the condition utilized, measured IC50 values approximate the Kd value of the binding interaction [28]. The IC50 values are calculated based on radioactivity detection in a 96-well plate measured by TOPCOUNT NXT microscintillation reader. Information on the tested peptides, MHC alleles, and corresponding MHC-peptide binding affinities are stored and analyzed in the database.

All peptides used at the laboratory site are assigned a peptide identifier in the TOPCAT database, and the tubes containing the peptide are labeled accordingly. This identifies the sequence of the peptide and the protein and organism from which the peptide was derived, as well as the purity of the synthesis. If the same peptide is synthesized multiple times, a new identifier is assigned to each unique synthesis. The ELICAT database uses the same peptide identifiers, and uses them to retrieve information about peptides from the TOPCAT system.

For the allergen epitope mapping study, peptides derived from protein sequences of common human allergens were synthesized. Each is being tested for binding affinity against a panel of 23 human MHC class II alleles. Through the integration of the three databases shown in Figure 2, it now becomes possible to directly link the allergic status of a donor, the peptides derived from the corresponding allergens, and their binding affinity to the MHC alleles present in the donor and the T cell reactivities detected towards these peptides.

4. System Design Process and Implementation

ELICAT and TOPCAT have been in routine use in the lab for more than 5 years. Regarding DONORDB, requirements were gathered in terms of what information needs to be captured for the process of donor enrollment, donor interviews, clinical procedures, blood sample shipments, and lab tests. For each step, the information desired to track progress was identified. Based on this, prototype web pages were created, and discussed with the clinical collaborators and LIAI lab scientists. Test of these prototypes was performed by both sites that analyzed use-scenarios, which led to the identification of additional requirements. These were, primarily, giving the users more flexibility in the order in which data was entered, and the capacity to store additional information. After three iterations, the prototypes were considered complete, and a functional system was implemented. The addition of more fields for new types of studies will require going through the same design process and modifications of the database and web application itself. As we gather experience in the kinds of modifications to expect, we plan to create tools that will ease making standard extensions.

All three databases are relational databases with web-browser-based interfaces and are implemented with SQL server 2005. The user interface was implemented with ASP for TOPCAT, and ASP.NET 2.0 for DONORDB and ELICAT. All three databases are hosted on a Dual-Processor Quad-Core Intel Xeon Rackmount Server. The database information is stored on a RAID 5 drive which ensures against single hard-disk failure, and is backed up daily through the LIAI network.

5. Usage

At this point, clinical information for 86 donors at the UCSD sites and 71 donors at NJMRC was deposited into DONORDB. Immunogenicity and MHC binding data for more than 60,000 peptides are stored in ELICAT and TOPCAT. We have not, so far, encountered data loss during operation. The source code of IMMUNOCAT is available for download at http://donor.liai.org/Donor_Source.zip. Users will need to have an installation of SQL Server 2005 DBMS and ASP.NET 2.0 framework, and the ability to customize and configure the database for their own purposes. The code will be maintained for at least the next four years, and updates will be made available as new versions of the systems are completed. Prospective users are highly encouraged to contact the authors with any installation problems. For studies that require extensive modifications of the code and for laboratories operating incompatible IT environments, it will be preferable to redesign the application from scratch. In those cases, the present manuscript should still be useful in identifying requirements and reusing appropriate design patterns.

6. Future Prospects

We are planning to implement more search and analysis interfaces that take advantage of the integrated data in IMMUNOCAT. Currently, such analyses are being done by manually running SQL queries, which cannot be expected of laboratory technicians.

We are in the process of extending DONORDB to handle donors from three additional epitope-mapping studies, one dealing with donors from Dengue fever endemic regions, another dealing with donors that have tuberculosis, and a third with donors recently vaccinated against smallpox. This will require modifying some of the information captured for each donor, such as the serological typing of patients for the type of dengue viruses they have been exposed to. Other aspects, such as MHC typing and blood sample collection, will remain the same. In the longer term, we are aiming to make the customization of DONORDB for individual studies easier by allowing the addition of fields from predefined modules. This could, for example, mean that different lab tests performed could be chosen from an ontology such as the Ontology for Biomedical Investigations [29], and that vaccine terms could be taken from the Vaccine Ontology [30].

Finally, one of the goals of IMMUNOCAT is to submit data from completed studies into the Immune Epitope Database [31, 32]. Currently, both TOPCAT and ELICAT have export mechanisms that provide XML formatted data that can be imported into the IEDB. These mechanisms need to be integrated and further updated to integrate the data from DONORDB, and provide comprehensive export functionality of an entire study.

Acknowledgments

This work was supported by NIH contracts HHSN272200700048C, HHSN272200900042C, and HHSN272200900044C, and Grant no.5 T32 AI00749-14. The authors want to thank Carla Oseroff, Ravi Kolla, Carrie Moore, David Broide, Debbie Broide, Rafeul Alam, Linda Bannister, Susanna Burr, and Howard Grey for helpful discussions.