Abstract

Technological advancements are constantly increasing the size and complexity of data resulting from large-scale RNA interference screens. This fact has led biologists to ask complex questions, which the existing, fully automated analyses are often not adequate to answer. We present a concept of 1Click1View (1C1V) as a methodology for interactive analytic software tools. 1C1V can be applied for two-dimensional visualization of image-based screening data sets from High Content Screening (HCS). Through an easy-to-use interface, one-click, one-view concept, and workflow based architecture, visualization method facilitates the linking of image data with numeric data. Such method utilizes state-of-the-art interactive visualization tools optimized for fast visualization of large scale image data sets. We demonstrate our method on an HCS dataset consisting of multiple cell features from two screening assays.

1. Introduction

High Content Screening (HCS) is of growing importance, allowing for efficient large scale experiments in biological research including drug discovery [1, 2]. HCS uses microscopy images of cells and advanced image analysis methods to detect the effects of RNAis or small chemical compounds on a cellular process of interest; in a multivariate approach, multiple cell parameters are collected, allowing for a complex analysis approach. HCS data include information about the library of bioactive molecules, up to several million microscopy images as well as image analysis data with more than 100,000 rows and more than 20 columns (matrix: RNAi sample × cell features). For quantified image processing results, there is a need to generate statistical, quality control and bioinformatics information up to the final hit list. The use of any data analysis tool requires the researcher to appropriately tune these parameters for a specific dataset, in order to avoid arbitrary results in terms of the number of data clusters, size, or density.

Today, HCS experiments are routinely performed under multiple experimental conditions on multiple test samples and with multiple staining channels. For example, in an experiment with two genotypes and two time points, a scientist might be interested in finding genes that are similarly expressed at the first time point in both genotypes but expressed differently at another point in the genotypes. Exploring such data can benefit substantially from interactive visualization tools that bring the problem of data mining and analysis closer to the individual researcher in the field, by allowing real-time visual data manipulation.

Applied to the context of this paper, HCS data analyses belong to information visualization. Shneiderman [3] published the often-cited Visual Information Seeking Mantra: “Overview first, zoom and filter, then details on demand.” Data visualization played an important role already in early reports in cancer microarray studies. For instance, Khan et al. [4] summarized their analysis results in a planar visualization that shows a clear separation of diagnostic cases. VizRank [5] can score the visualizations according to the degree of class separation and can work through projection candidates to find those with the highest scores. Differently to the approach proposed in this paper, information in their visualization could not be traced back to the original genes as the plot was obtained by multidimensional scaling and used features crafted in several data preprocessing steps (feature selection through neural network learning and feature construction by principal component analysis). Their visualization was therefore not a result of an explicit search. McCarthy et al. [6] were the first to show how RadViz algorithm which can be applied to the analysis of class-labeled datasets from biomedicine. They focused on feature subset selection to reduce the number of genes in visualization and feature grouping (clustering anchors of correlated features) and showed that in such an arrangement the visualizations can provide for a clear separation of instances of different class. We present here unique software methodology which not only provides interactive visualization of data but also links quantified image processing results with raw microscopic image data based on KNIME workflow environment [7].

Sharing requires centralized-image databases repositories which require common image data formats. This is particularly challenging for multidimensional microscopy image data, which are acquired from a variety of microscopes. 1Click1View (1C1V) method uses the sophisticated open standard format Bio-Formats that supports different types of screening image data (e.g., Tiff 8bit, 16bit, JPEG, JPEG2000, etc.). A user can easily access basic image processing functions from image viewer.

In order to facilitate data preparation for visual analysis, preprocessing and postprocessing steps should be simple and should not require programming knowledge. We suggest to implement 1C1V in workflow systems which is crucial for enabling users to deal with data preprocessing and post-processing. The concept of workflow is not new, and it has been used by many organizations, over the years, to improve productivity and increase efficiency on data preparation. A workflow system is highly flexible and can accommodate any changes or updates whenever new or modified data and corresponding analytical tools become available. 1C1V is a general-purpose interactive visualization methodology for multiparametric screening assays that is easy to integrate with KNIME [7] workflow system. The basic idea of 1C1V is to provide the data in one visual frame that allows users to gain insight into the data and generate hypotheses by directly interacting with image data. The advantage of 1C1V is that users are directly involved in the image processing results to combine the flexibility, creativity, and general knowledge of the scientist with the enormous amount of numerical rows connected to image files. 1C1V is developed to help scientists and answer complex queries through interactive visual exploration of screening datasets. One key attribute of the 1C1V that distinguishes it from past and current image data exploration methodologies is that the original image data, their image processing results, and metadata (additional information captured by acquisition software about an image, such as the instruments used, camera, acquisition settings, image size, and resolution) are linked together and are available for filtering, clustering in interactive view.

2. Systematic Errors and Quality Control in HCS

HCS has already proven successful as a method to deliver more relevant information simultaneously in one experiment, rather than delivering a single readout in a series of sequential experiments [811]. A prototype scenario might be the series of simultaneously available readouts obtained from a cellular assay. One parameter identifies cells (i.e., membrane dye at first wavelength), another determines the stage of mitotic change (e.g., fragmented and condensed nuclei at a second wavelength), and a third parameter classifies the apoptotic stage using morphological criteria at a third wavelength. Certainly, these analyses can already be performed almost autonomously with very high throughput. HCS operates with samples in microliter volumes that are arranged in two-dimensional plates. A typical HCS plate contains 96 ( ) or 384 ( ) samples. The quality control and normalization procedure in plate-based primary HCS screens can be mainly performed by using interactive visual analysis.

Quality of measurements has a number of advantages, including objectivity, reproducibility, and ease of comparison across screens. Random and systematic errors can cause a misinterpretation of candidates to be a hit. They can induce either underestimation (false negatives) or overestimation (false positives) of measured parameters. Various methods dealing with quality control are available in the scientific literature. These methods are discussed in details in the papers by Brideau et al. [12], Heyse [13], and Zhang et al. [14, 15]. There are various sources of systematic errors:(i)Systematic errors caused by ageing, reagents evaporation or decay of cells can be recognized as smooth trends in the plate’s means/medians;(ii)Errors in liquid handling and malfunction of pipettes can also generate localized deviations of expected data values;(iii)Variation in incubation time, time drift in measuring different wells or different plates, and reader effects may be recognized as smooth attenuations of measurement over an assay.

Brideau et al. [12] demonstrated examples of systematic signal variations that are present in all plates of an assay. For instance, Brideau et al. [12] illustrates a systematic error caused by the positional effect of detector. To guarantee this reliability avoiding systematic errors, data quality control at different levels is a must. This begins in the optimization phase of the assay: In test runs with a small number of compound plates, the assay has to possess a sufficient signal window (e.g., Z-factor [15]), stability, and sensitivity (e.g., measured by the effects of known control compounds) [16, 17]. If problems occur, the parameters of the assay or even its format should be tuned to match the quality criteria of HCS. Data quality control on the level of an individual assay seeks again to guarantee assay stability and sensitivity, which must be monitored constantly using the appropriate controls. At the same time, it tries to pick up on process artifacts caused by failures in the screening machinery or the test system (e.g., a blocked pipettor needle, air bubbles in the system, or a changing metabolic state of reporter cells) [13].

If unnoticed, these artifacts can result in a high number of false positives, but seemingly “highly specific hits”. Often, such process artifacts can be detected by changes in the overall signal or by specific “signal patterns” on plates (e.g., pipettor line patterns), if the compound library is randomized across the screening plates. Visual analysis of QC with link to images is preferably done directly after the screening run to ensure that such patterns can be traced back to their origin (e.g., the pipettor may be inspected the next morning) and can be unambiguously classified as artifacts—or nonartifacts. This distinguishes false positives from real actives that should be more or less randomly dispersed when considering a whole series of plates from reasonably randomized compound collections.

The hypothesis underlying HCS data analysis is that the measured image descriptors for each single siRNA represent its relative number of observed objects at fluorescence image. Within-plate reference controls are typically used for these purposes (Figure 1). Controls help to identify plate-to-plate variability and establish assay background levels.

3. Interactive Image Data Exploration in Workflow Environment

High dimensionality of datasets makes it difficult to find interesting patterns. To cope with high dimensionality, low dimensional projections of the original dataset are generated; human perceptual skills are more effective in 2D or 3D displays. Mechanisms have been developed to enable researchers to select low dimensional projections, but most are static or do not allow users to make one view frame and crosslink what is interesting to them. Software applications based on those mechanisms require from users many independent clicks which in consequence change the content of main view panel (Figures 2(a) and 2(b)). There is, therefore, a need for a mechanism that allows users to choose the property of projections that they are interested in, rapidly examine projections, make filtering, and locate interesting view panels to detect patterns, clusters, or outliers.

In 1C1V, data can be visualized at several stages of analysis and linked to raw image data keeping concept of star model (Figures 2(c) and 2(d)): one-click, one-view. Unmodified and transformed datasets can be plotted interactively as scatter plots, displayed in histograms, viewed in image viewer/editor or viewed as tables. Entire experiments can be displayed in various overview plots in the context of how they are annotated, and figures and tables can be exported for publication. Data can also be exported for custom analyses (e.g., for algorithms that are very expensive to computer power and time) and local development of new analysis methods and in various defined formats for use in external postanalysis applications.

Data can be visualized at several stages of analysis. Unmodified images and extracted results datasets can be plotted interactively as scatter plots, displayed in histograms, or viewed as tables (Figure 3). An entire result table can be displayed in various overview plots, and tables can be exported for publication. From any data-analysis step, the experiment can be imported into a data-visualization interface in in which the data can be browsed and viewed.

Tools based on 1C1V method should provide a number of projections which interact with image data, filters, tables, and colors (Figure 3). First step to be done after loading the data (images) and extracting image features is configuring GUI, that is, deciding which variable is associated with which axis. This configuration is done in real-time and can be changed easily. Once the axes are set and the data is mapped to the unit cube, the user can start to explore and identify interesting objects by performing the following operations.(1)Create 2D scatter plots of any two experiments by orienting the third axis perpendicularly to the display plane.(2)Select “interesting” objects (e.g., genes) and filter out uninteresting objects (e.g., genes) using a combination of selection tools.(3)Assign color and shape to selected objector groups of genes.

Nauru offers four different ways to select objects. They are as follows.(1)Mouse Click: a gene can be selected by a mouse click. Annotation for the selected gene and its parameter values for all experiments are shown.(2)Selection Plane: the selection plane is a 2D rectangle. A user can move and resize the rectangle. All genes that lie within the rectangle after projection on to the 2D screen are selected. A user can browse through the list of annotation and expression value information of selected genes.(3)Range Selection: a range of differential cell parameters values can be specified for each of the axes and color. All genes falling outside the specified range are not shown. This selection mechanism is useful for setting cut-off values at which the differential expression can be considered significant.

Visualizations are the key to analyzing data in 1C1V. A variety of visualization types can be used to provide the best view of the data:

3.1. Image Viewer (IV)

The 1C1V image viewer (Figure 4) as a key element for HCS data allows users to visualize and edit images and image processing results. IV is highly integrated with interactive table (e.g., sample annotation), plots, data visualization features, and data controls. Image Viewer displays images very quickly, and these images may be viewed in full screen, as slideshows or as thumbnails. It is quite capable of processing images; user can rotate, adjust brightness and color, and apply filters or LUT tables for each independent staining channel. The editor of image viewer has a variety of tools like a Fire, 3-3-2 RGB, Grays, Ice, Spectrum, and export figures for publication and great effects that can be applied on selected image or entire image collection. Moreover, thanks to Bio-Formats plugin, it reads many formats, including Tiff: (8, 16, and 32 bits), JPEG, and JPEG2000. A very special feature of image viewer is a raw data security system. All image processing steps are saved as independent feature mask in settings file. Raw images stay unchanged. Batch processing options (apply the same image processing steps for all images) can be carried out on more than one file at a time by clicking “Save”. It offers nearly instantaneous hotkey zooming. It also allows having several running multiple instances of image panel if you like to browse in different windows.

IV also recognizes image metadata data and optionally displays it alongside the images. Metadata panel can display the basic image metadata like microscope features or exchangeable image file format (exif) tags. Another outstanding plus of IV is the transformation of image metadata into an interactive table (described below). Metadata Panel can also decode CR2, CRW, and exif files and extract the embedded headers. Exif data are embedded within the image file itself. Exif is a standard that specifies the formats for images, ancillary tags used by digital cameras, scanners and other systems handling image, and sound files recorded by digital cameras. The specification uses the following existing file formats with the addition of specific metadata tags: JPEG DCT for compressed image files, TIFF Rev. 6.0 (RGB or YCbCr) for uncompressed image files, and RIFF WAV for audio files (Linear PCM or ITU-T G.711 μ-Law PCM for uncompressed audio data, and IMA-ADPCM for compressed audio data). It is not supported in JPEG 2000, PNG, or GIF. The exif format has also standard tags for location information. Currently, few cameras and a growing number of mobile phones have a built-in GPS receiver that stores the location information in the exif header when the picture is taken.

3.2. Scatter Plots

Scatter plots are similar to line graphs in that they use horizontal and vertical axes to plot data points. However, they have a very specific purpose. Scatter plots show how much one variable is affected by another. Each record (or row) in the dataset is represented by a marker whose position depends on its values corresponding to the and axes.

The above picture demonstrates how scatter plots can be used. Say, for example, that a user wants to show whether there exists a consistency in cell number across all the screening samples. A third variable can be set to correspond to the color or size of the markers, thus adding yet another dimension to the plot. Two-dimensional scatter plots are the default visualization of many datasets.

3.3. Bar Charts

A bar chart is a way of summarizing a set of categorical data. It displays the data using a number of bars of the same width, each of which represents a particular category. The length of each bar is proportional to the count, sum, or the average of the values in the category it represents, such as age group or geographical location. In Nauru, it is also possible to color or split each bar into another categorical column in the data, which enables you to see the contribution from different categories to each bar in the bar chart.

3.4. Interactive Tables

The table visualization presents the data as a table of rows and columns. The table can handle the same number of rows and columns as any other visualization in Nauru. In the table, a row represents a record. By clicking on a row, you make that record active, and by holding down the mouse button and dragging the pointer over several rows, you can mark them. You can sort the rows in the table according to different columns by clicking on the column headers or filter out unwanted records by using the query devices. Different types of visualizations can be shown simultaneously. They are linked and are updated dynamically when the query devices are manipulated (see below). Visualizations can be made to reflect high-dimensional data by letting values control visual attributes such as size, color, shape, rotation, and text labels. You can sort the vertical order of the rows in the table. This can be done in several steps, for example: first sort, according to the values in column 1, then by the values in column 5, then by the values in column 3, and so forth.

3.5. The Filter Panel

Filter panels are used to filter data. Filter panel devices appear by demand in several forms, and scientist can easily select a type of query device that best suits user’s needs (e.g., combo boxes, sliders, etc.). When manipulating a filter by moving a slider or selecting a multiple box, all visualizations are immediately updated to reflect the new selection of image data. In such cases, researchers want to quickly find only RNAi constructs, genes, or compounds similar to the expected pattern.

4. Workflow Platform for Preprocessing and Postprocessing

The concept of workflow is not new, having been used by many organizations, over several years, to improve productivity and increase efficiency. A workflow system is highly flexible and can accommodate any changes or updates as when new or modified data and corresponding analytical tools become available. Raw metadata are not always in ready format for visualization. A workflow environment allows biologists themselves to prepare data for visualization without involving any programming. Workflow systems are different from programming scripts and macros in one important respect. Programming systems and macros use text-basedlanguages to create lines of code, while applications like Nauru use a graphical programming language. In general, workflow systems concentrate on creation of abstract process workflows to which data can be applied when the design process is complete. In contrast, workflow systems in the life science domain are often based on a dataflow model, due to the data-centric and data-driven nature of many scientific analyses. A comprehensive understanding of biological phenomena can be achieved only through the integration of all available biological information and different data analysis tools and applications. A workflow environment allows HCS researchers themselves to perform the integration without involving any programming. As such, the workflow system allows the construction of complex in silico experiments in the form of workflows and data pipelines. Data pipelining is a relatively simple concept. Any computational component or node has data inputs and data outputs. Data pipelining views these nodes as being connected together by “pipes” through which data flow (Figure 5).

Nauru builds a flow by dragging and dropping nodes from the node repository into the main panel and connecting them. Nodes are the basic processing units of a workflow (Figure 5). Each node has input and/or output ports. Data are transported through connections from these node output ports into connected input ports. After positioning the nodes, the inputs of each node are fully connected to outputs of a predecessor node. This is achieved by clicking on an output port and dragging the connection to the input port that should receive data from this output. All data flowing between nodes are wrapped within a class called DataTable, which holds metainformation concerning the column headers and the actual data (e.g., numeric data, image name, image path, image processing parameters, and gene annotations). The data can be accessed by iterating over instances of DataRow. Each row contains a unique identifier (or primary key), a specific number of DataCell objects, image name, and image path. A workflow system is highly flexible and is designed to accommodate any changes on table before interactive visualization.

5. Comparison of Existing Visualization Methods in HCS

We compared existing solutions for visualizing of HCS data and made a summary and comparison of few methodologies and solutions in Table 1.

6. Case Studies

Visualization of image processing features extracted from raw data helps to discover systematic plate-to-plate variation, making measurements comparable across plates. Systematic errors decrease the validity of results by either over- or underestimating true values. These biases can affect all measurements equally or can depend on factors such as well as location, liquid dispensing, and signal intensity. Although recent improvements in automation can minimize bias, and thereby provide more reproducible results, equipment malfunctions can nonetheless introduce systematic errors, which must be corrected at the data processing and analysis stages. For illustration, two screens have been taken into visual quality control analysis using Nauru software described in Table 2. For each cell of the samples, the image processing software (Advanced Cell Classifier [18]) quantified the images by calculating more than a hundred parameters, which mainly fall into the following categories:(i)geometric properties such as the area, perimeter, and shape of the cell nucleus; the location of a cell; average distance of a cell to its neighbours;(ii)intensity information such as the content of a protein, as reflected by the intensity of the corresponding fluorescent dye, and the variance, skewness, and kurtosis of the intensity distribution.

As a result of the analysis of screens 1 and 2, we were able to detect significant errors: (i)plate-to-plate variation (Figure 6(a)),(ii)problematic plates (Figure 6(a)),(iii)edge effect on row A and P of 384 well plate (Figures 6(a) and 6(b)),(iv)signal intensity reduction over plates (Figure 6(b)),(v)batch/sub-batch consistency (Figure 6(b)),(vi)hit-cell analogy assessment (Figure 7(a)),(vii)mistake on dispensing-cell seeding (Figure 7(b)),(viii)cell seeder and seeding distribution over 384 well plates (Figure 7(c)).

7. Conclusion

In this paper, we offer interactive methodology 1C1V for the analyses of HCS datasets. High dimensionality of the datasets hinders users from recognizing important patterns in the datasets. We add a scatter plot browsing mechanism that helps users select interesting 2D projections of the high dimensional multiparametric dataset. In addition, an even more difficult problem is to understand the biological significance of the patterns found in the datasets. Considering the needs of working professionally in the field of HCS data analysis, 1C1V is an effective and efficient method for interactive image data exploration, detection of systematic errors, and quality control. We have demonstrated through simple examples how quickly a researcher could investigate problems of a dataset before making a final decision.

This work is a part of our continuing effort to give users more controls over data mining processes and to enable more interactions with analysis results through interactive information visualization techniques. These efforts are designed to help users perform exploratory data analysis, establish meaningful hypotheses, and verify results. In this paper, we show how those visualization methods can help molecular biologists analyze and understand multidimensional RNAi cell-based screening data. Currently for licensing reasons a tool Nauru developed based on 1C1V methodology is not available as open source software. At the beginning of year 2013, the open source version of 1C1V will be available as KNIME node [7].

Conflict of Interests

There is no conflict of interests to be declared.