Abstract

Demand for high-resolution, low-power sensing devices with integrated image processing capabilities, especially compression capability, is increasing. CMOS technology enables the integration of image sensing and image processing, making it possible to improve the overall system performance. This paper reviews the current state of the art in CMOS image sensors featuring on-chip image compression. Firstly, typical sensing systems consisting of separate image-capturing unit and image-compression processing unit are reviewed, followed by systems that integrate focal-plane compression. The paper also provides a thorough review of a new design paradigm, in which image compression is performed during the image-capture phase prior to storage, referred to as compressive acquisition. High-performance sensor systems reported in recent years are also introduced. Performance analysis and comparison of the reported designs using different design paradigm are presented at the end.

1. Introduction

Image sensors are found in a wide variety of applications, such as biomedical microsystems, mobile devices, personal computers, and video cameras [1]. Since the image resolution and frame rate keep on increasing, image-processing capabilities are becoming an important consideration for both still image and video devices. Compression is one of the most demanding of the processing steps. Still, image compression is achieved by removing spatial redundancy, while in video devices, temporal redundancy can be used to further improve the compression performance. Various image compression algorithms and coding schemes have been proposed, such as predictive coding, Discrete Cosine Transform- (DCT-) based compression algorithms [24], and wavelet-based compression algorithms [58]. International committees publish standard coding schemes for both still image and video stream, such as the Joint Photographic Experts Group (JPEG) standard series [9, 10], the H.26x [11] standard series published by the International Telecommunications Union (ITU), and Moving Picture Experts Group (MPEG) standards [12] published by the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC).

VLSI circuit implemented image/video compression standards have been reported in the early literature [1315]. Improvement in CMOS technology results in CMOS image sensor competitive to CCD image sensor in many applications [16, 17]. Correspondingly, the market share of CMOS image sensor has rapidly increased from the late 1990s. CMOS technology enables the integration of image sensing and image processing, making the CMOS image sensor the optimum solution to improve the performance of the overall system. In the last several decades, image sensors integrating different on-chip compression algorithms, such as predictive coding [18], wavelet-based image processing [19], DCT-based image processing [20], conditional replenishment [21], SPIHT algorithm [22], and FBAR and QTD processing [23] have been developed. A simplest way to design such a system is to implement different functions using sperate circuit units, such as image-sensing unit and image-compression processing unit. Another option is to implement image compression into the sensor array, which is referred to as focal-plane compression. If image-sensor system integrates focal-plane compression, image-processing is performed near the focal plane once the pixel value is captured. Higher image quality and higher processing rate can be expected in such a system. Recently, a new design paradigm referred to as compressive acquisition [24] proposed to carry out image compression during the image capture phase, moving the image-processing phase prior to the storage phase. It combines image capture with compression, enabling lower on-chip storage requirement.

Some widely used image-compression algorithms are firstly reviewed in Section 2. Section 3 compares the different paradigms that are used in the development of CMOS image-sensors that integrate on-chip image-compression algorithms. These design paradigms include image-sensor array integrated off-array image compression processor, image-sensor array integrated focal-plane compression, and compressive-acquisition image-sensor array performing compression processing during the capture phase. At the end of Section 3, some high-performance system reported in recent years is also introduced, followed by performance analysis and comparison between the reviewed designs. Section 4 concludes the overall paper.

2. Standard Image/Video Compression

2.1. Still Image-Compression Algorithms

Image compression reduces the amount of data required to represent a digital image by removing data redundancy. Image-compression algorithms can be classified into lossy and lossless.

In lossy image-compression, there is a tradeoff between the compression ratio and the reconstructed image quality. If the distortion caused by compression can be tolerated, the increase in compression ratio becomes very significant. Lossy image-compression algorithms can be performed in either spatial domain or transform domains (such as frequency domain). A classical lossy image-compression scheme is lossy predictive coding. The predictive value is quantized using limited bits. There are different effective predictors, such as the gradient-adjusted predictor (GAP) and the median adaptive predictor (MED). Another way to compress the image is to firstly map the image into a set of transform coefficients using a linear, reversible transform, such as Fourier transform, discrete cosine transform (DCT) transform [24], or wavelet [58] transform. The newly obtained set of transform coefficients are then quantized and encoded.

Lossless image-compression algorithms are error-free compression, which are widely used in medical applications, satellite imagery, business documentation compression, and radiography area because any information loss is undesirable or prohibited. Typically, compressing an image consists of removing either coding redundancy or interpixel redundancy, or both. The simplest way to compress an image is to reduce the coding redundancy using variable length-coding schemes. A variable length-coding scheme maps source symbols to a variable number of bits. Huffman coding [25, 26] and arithmetic coding [27, 28] are all well-known variable length-coding strategies. An effective way to reduce inter-pixel redundancy is to use bit-plane coding schemes. In a bit-plane coding scheme, an image is firstly decomposed into a series of binary images, and then those images are compressed by a binary compression algorithm, such as run-length coding scheme and contour-tracing and coding scheme. Another way to remove inter-pixel redundancy is to use Lempel-Ziv-Welch (LZW) coding scheme [2931], which replaces character strings with single codes without the requirement of any prior knowledge of the probability of occurrence of the symbols.

In order to make sure that compressed images can be decoded correctly, standardization of image compression algorithms becomes very significant. The Joint Photographic Experts Group committee proposed the DCT-based JPEG standard [9, 10] and the wavelet-based image compression standard JPEG 2000 [3236], in 1992 and 2000, respectively.

2.2. Video-Coding Schemes

Research in video-coding compression can be traced back to early 1980s. From that time on, the Consultative Committee for International Telegraph and Telephone (CCITT) and the International Organization for Standardization (ISO) began to standardize the various video-coding schemes. Later, H.26x [3740] and MPEG-1 [41]/MPEG-2 [42]/MPEG-4 [4345] are published and used in various video devices and applications. The newly reported video coding scheme, the so-called distributed video coding (DVC) [4648], is based on the Slepian-Wolf [49] and the Wyner-Ziv [50] theorems in information theory published in 1970s. As DVC provides better compression efficiency than the widely used standards, there is much research focusing on this area in recent years.

3. On-Chip Image Compression for CMOS Image Sensors

3.1. Image Sensors with Off-Array Image Compression

In the 1990s, taking advantage of the development of CMOS technology, CMOS image sensor began attracting the attention of designers. The improvement of manufacturing processing and the use of noise-reduction techniques decreased the noise levels in CMOS image sensors, making them comparable to the conventional CCD image sensors. Using CMOS image sensor, peripheral image-processing circuits can be easily integrated with the sensor array.

3.1.1. Typical Design Reviews

A typical smart image-sensor system implements the image-capturing device and the image processor into separate functional units: an array of pixel sensors and an off-array processing unit as illustrated in Figure 1. Active Pixel Sensor (APS) array is widely used to convert light intensity into analog signals. A standard APS architecture includes a photodiode, a transfer gate, a reset gate, a selection gate, and a source follower readout transistor. The reset gate resets the photodiode at the beginning of each capture phase. The source follower isolates the photodiode from the data bus. The analog signals from the sensor array take raw pixel values for further image processing. Different designs have been reported in the literature implementing different compression algorithms in the off-array processor.

In [18, 5154], predictive coding schemes are implemented in the off-array column-level processor. Predictive coding is a very useful image-processing scheme with low computational complexity. It removes correlation between neighboring pixels. The simplest lossless predictive coding scheme, the gradient-adjusted prediction (GAP) integrated with image-sensor array is first proposed in [51]. The mean value of the upper and the left neighbors are used as the predictive pixel value . Pixels in the first row and first column are not predicted. An analog-predictive circuit is used to calculate both the predictive pixel value and the prediction error between the predictive value and the raw value. A more complexity version of GAP using the sum of the weighted upper, left, top left, and top right neighboring pixels is implemented in [52]. In [18, 53], the median adaptive predictor (MED predictor) is implemented in the off-array logic, predicting pixel value using the upper pixel value (denoted as ), the left pixel value (denoted as ), and the top left pixel value (denoted as ) as expressed in Architecture of the overall system is illustrated in Figure 2. Pixel-level capacitors are used to buffer the raw captured pixel value before further processing. After reading-out one pixel , a differential amplifier, which is integrated in the analog-predictor circuit, subtracts the photo-generated voltage from the reset voltage so as to reduce the pixel offset fixed-patten noise. Both the analog raw pixel values and the analog predictive values are converted into digital values in column-level single-slope A/D converters. The digital counters of the single-slope A/D converts are also used to generate Golomb-Rice codes by counting the total times that the counter overflows the counting range (0 to ). Testing results show a compression ratio of around 1.5 can be achieved, which is comparable to the lossless compression standards.

In [55], the authors proposed an integration of image sensing and wavelet-based image processing. The overall sensor array consists of APSs, each of which integrates a photodiode, a reset transistor, and a source follower. Correlated double sampling (CDS) circuits are used in every column to reduce the fixed pattern noise (FPN), pixel KTC noise, and 1/f noise. iterations are required by the Haar transform, as the overall raw image has been divided into blocks. Only adder and subtractor processing elements are required in each column parallel basic Haar transform unit circuitry to complete the Haar transform. Row pixel values readout from the column CDS circuits and a copy of those values are buffered in two parallel capacitors. The calculation of Haar transform is performed by switches and capacitors. By integrating the image processing, the amount of data required to be transformed during communication is reduced. This improvement of performance is significant for high-resolution and high-frame rate image sensing, as well as for power constrained and bandwidth constrained devices.

High-throughput image sensors integrated with off-array image compression have been reported in [56, 57]. In [56], a CMOS image sensor integrated together with a 2D Haar wavelet image-compression algorithm has been proposed. 1.4 GMACS (Giga Multiply Accumulate Operations Per Second) throughput with SVGA imager resolution is achieved, while in [57], 4 GMACS throughput with HDTV 1080i resolution is achieved using an image sensor integrating an on-chip DCT processing. Both of the two proposed designs were mixed-signal CMOS image sensors, which took the advantage of both analog and digital design techniques as illustrated in Figure 3. The combination of weighted spatial average and oversampling quantization carried out in a single conversion cycle of the -modulated ADC enables real-time focal-plane processing. While the implementation of digital logic enables digital output with high accuracy.

3.1.2. Efficiency of the Image-Sensor Array Readout

At the beginning of CMOS sensors use, designers were apt to treat the image capture device and the image processing unit as two independent units only connected by the readout interface, which enables row by row scanning of the analog raw pixel values from the sensor array. But later, embedded pixel-level control logic and corresponding off-array readout circuit becomes a better option in such a system design, as it can greatly improve the efficiency of further image compression processing.

In [58], a CMOS block MAtrix Transform Image architecture (MATIA) is designed as a front end for JPEG compression as illustrated in Figure 4. A high-swing cascode operational amplifier is used for pixel readout, as well as for current measurement for programming floating gates in the following processing units. The matrix coefficients are stored using an array of floating-gate circuits, each cell of which is a polysilicon gate surrounded by silicon dioxide. The charge on the floating gate can be permanent as the surrounded silicon dioxide provides high-quality insulator. A column of current to voltage (-) converters are connected to the floating-gate array to convert the respective current to bias voltage for the matrix-vector multiplication. Current mode differential vector matrix multiplier (VMM) is used to perform matrix-vector multiplication, instead of the traditional voltage implementation, because of the VMM's high processing speed, low power consumption, and high linearity. Taking advantage of the programmability of the proposed architecture, two-dimensional (2D) transforms or filter operations on an entire image, or block matrix operations on subimages can be performed. Various block transforms can be implemented, such as the DCT, the discrete sine transform (DST), and the Haar transform. It can be extended to different applications, including motion estimation, depth computation from stereo, and spatial or temporal compression and filtering. In [58], the 2D DCT and Haar transform are carried out in MATIA as examples, in both of the two examples, elemental block size is used.

In one of the most recent published work which integrated wavelet in the off-array processor [19, 59], two capacitors are implemented at the pixel-level to store both the reset voltage and the integrated photocurrent for subsequent multiple sampling processing as illustrated in Figure 5. As nondestructive readout is used in the readout interface, one very interesting feature of the pixel architecture is that the spatial image processing is available as well as temporal image processing based on frame difference calculation. The proposed work consists of APS array. Column-based processing elements are used to perform the block-matrix transform on the readout pixel values as expressed in which requires pixelwise signed multiplication and cross-pixel addition processing that will be carried out in the mixed signal VLSI domain column-based logic circuit. The column based circuit consists of a sign unit, a binary analog multiplier, an accumulator, and multiplying analog to digital converters (MADCs). A switch matrix takes the block-matrix coefficients value from the binary analog multiplier and the corresponding sign bits from the sign unit and sends these signal to the binary analog multiplier. The MADCs are used to multiply the readout pixel values with respective digital coefficients for convolutional transform. The proposed computational functionality of the sensor array is validated on-chip Haar discrete wavelet transform- (DWT-) based image compression. The transform results are compared to a threshold before data transmission. Transform results smaller than the threshold are filtered out.

Reference [60] reported a high-speed (>1000 fps) CMOS image sensor with a resolution of integrated together with a DCT processor. In order to achieve a high-processing speed, global electronic shutters are implemented with pixel-level sample-hold function unit as illustrated in Figure 6. Raw pixel values are read out and digitized into a 10-bit digital signal in parallel ADC row by row. The 10-bit digital pixel values are buffered and rearranged in an input buffer memory before being sent to the array of image compression processing elements (ICPE). 2D DCT, quantization, zigzag scanning, and Huffman coding are carried out in the ICPEs. In the DCT processing, elemental block size is used instead of the more commonly used DCT elemental block size, so as to reduce the amount of inner product computations. Experimental results show that for a image sensor array, 3000 fps can be achieved under an operation frequency of 16.8 MHz. The frame rate can be as high as 8500 fps if 47.6 MHz operation frequency is used. Thus, with the proposed architecture, 1 M-pixel, 3000 fps digital-image sensor can be realized under an operating frequency of 53 MHz.

Compared to one by one readout, block-based readout is a more efficient way to prepare raw pixel data for further block-based transforms. In [61], The sensor array is divided into blocks which is the elemental block size of the cosine coefficient matrix used in the following analog 2D DCT processing unit and a subsequent analog-to-digital converter/quantizer (ADC/Q). Raw pixel data values are read out from the sensor array block by block during one readout circle. There are two stages in the readout phase. At the first stage, a front-end amplifier and a successive fully differential amplifier convert the signal charge into voltage using a 100 fF capacitor and shift the voltage range to be suitable for the subsequent signal processing. Both amplifiers are designed based on switched capacitor technique. At the second stage, CDS scheme is carried out to reduce the 1/f noise and offset voltage deviation. 2D DCT-based compression algorithms are implemented in the off-array processor as illustrated in Figure 7 to compress the raw captured pixel values [20, 62, 63]. The 2D DCT is performed using analog 1D DCT processor. The 1D DCT processor consists of 32 coefficient multiplications and 32 additions and switches control logic for weighted summation. 8 rows are read out and calculated at one time. The intermediate results are stored in an analog memory, each of which consists of 4 switches and 2 capacitors. 1D DCT is performed by 2 clock cycles. Thus, it takes 32 clock cycles to finalize an 2D DCT computation. 9-bit ADC with differential nonlinearity (DNL) of less than 0.5 least significant bit (LSB) is used to digitize and quantize the analog 2D DCT results in order to keep a high PSNR (over 40 dB). Variable length coding is utilized to further remove data redundancy before transmission. The overall sensor array consists of active pixel sensors (APSs). Taking the advantage of parallel processing, the operation frequency is only 62 kHz at 30 fps.

In [64], a sensor array integrated enables three different processing modes, integrated intensity mode (I-mode), spatial contrast mode (C-mode) and temporal difference mode (T-mode), during the readout phase using pixel-level control units. Only 11 transistors are required for temporal redundancy removing. Pixel-level capacitors are used to buffer raw capture pixel values or value from a selected neighbor. Winner-takes-all (WTA) and loser-takes-all (LTA) circuits are used to find out a brightest and a darkest pixel value among some selected candidates. Under I-mode, raw pixel values are captured, which maximum and minimum differential values between 4 neighboring pixels can be calculated online in the off-array column based circuit under C-mode. Under T-mode, differential value between pixel values within neighboring frames are calculated in real time during the readout phase. All the three modes are implemented in pixel-level processor controlled by pixel-level switches.

3.1.3. High-Performance Compression-Processor Design

In fact, the performance of the off-array processor is one of the key factors that affect the performance of the overall system. High-performance off-array processor designs have been reported in the literature.

In [65], an digital DCT processor is proposed, in which a variable threshold voltage is used to reduce the active power consumption with negligible overhead in speed, standby power consumption, and chip area of a 2D DCT processing core for portable equipment with HDTA-resolution video compression and decompression. Comparing the design reported in [61] with that of [65], the power consumption reported in [61] is only about half of the power consumption reported in [65].

In [66], a low-power, real-time JPEG encoder is presented with a resolution of up to . Eight different optional output resolutions are available in the proposed system. A buffer is used as an interface between the CMOS image sensor and the encoder, which read out and rearrange raw captured pixel values from the sensor array. The reported JPEG encoder is fully compliant with the JPEG standard, including DCT, quantizer, run-length coding, Huffman coding, and packer unit. The DCT processing element in the proposed work consists of 3-level pipeline processing units. In the first level, raw pixel values are read out, while arithmetic distribution and DCT coefficient generation are carried out in the second and the third level unit, respectively. Experimental results showed that a 15 fps is achieved at a highest output resolution (), while 30 fps is achieved at a lower resolution ().

In 2000, Philips National Research Laboratory developed a Transpose Switch Matrix Memory (TSMM) to reduce power consumption of on-chip image compression processing that required block level communication. This was done by enhancing the data-access efficiency and the memory utilization. The proposed TSMM is used in a highly parallel single-chip CMOS sensor processor, Xetal [67], to perform JPEG compression at video rate (30 fps) with a resolution of [68]. The Xetal is a single-instruction, multiple-data (SIMD) linear-processing array for pixel-level image processing. 640 ADCs are used to digitize the analog values. Recursive block DCT is carried out in 320 off-array processing elements (PEs), 80 TSMM units and an embedded 40-line memory as illustrated in Figure 8. The TSMM is designed to facilitate the block-level interprocessor communication in Xetal, so as to perform DCT or JPEG on it. It consists of a matrix of registers, using switches to enable access to horizontal and vertical buses. The flexibility databus controlling of TSMM simplifies the implementations of the transposition of DCT coefficients and the Zigzag output scanning.

In [69], a VLSI circuit for wavelet image compression was reported. In the proposed architecture, there are four major processing elements: (i) a data format conversion processing element, which converts the raw pixel value from BAYER-RGB format into BAYER-YY format; (ii) a wavelet transformunit, which performs 1-D wavelet transforms firstly on rows and then on columns; (iii) a binary adaptive quantizer, which quantizes the wavelet transform coefficients; (iv) a significant coefficient pyramid coder, which further reduces the coding redundancy. Testing results show that under an operating frequency of 25 MHz, a processing rate of 1.5 M pixels/s is achieved. Color image with a resolution of can be compressed within 1 second if faster on-chip memory is available.

3.2. Image Sensors with Focal-Plane Image Compression

However, the computational complexity of most standard compression algorithms is still high, such as those mentioned in the previous subsection. While the image resolution and frame rate keep increasing, the processing time becomes a bottleneck in high-speed sensor design. In order to increase the processing rate, pixel-level processor are used to implement in-array parallel processing.

3.2.1. APS Array with Focal-Plane Compression Processor

In fact, as early as in 1997, imagers integrated with parallel in-array processing circuit were being reported in [21, 70]. In the proposed work, a computational image sensor explores the parallel nature of image signal by integrating conditional replenishment algorithm, a pixel-level video compression schemes, which removes temporal redundancy of pixels between consecutive frames. As illustrated in Figure 9, in each pixel, raw capture pixel value is firstly stored in a capacitor. A pixel-level differential amplifier is used to compare the newly captured value with the pixel value of the previous frame that is buffered in another capacitor. Only activated pixels are readout, in which the difference between the newly captured pixel value and the previously captured pixel value is larger than a threshold. Analysis in [21] shows that the higher the frame rate is, the lower the percentage of active pixels in the sensor array. Differences caused by the motion between the frames are much less while a higher frame rate is used. The threshold and the frame rate are adjusted in real time so that the total number of activated pixels read out for one frame is almost a constant. A compression ratio within 5 : 1 to 10 : 1 can be achieved without significant degradation of the video quality with low motion activity. show video stream Different methods to encode the activated pixel address are compared in [21]. In [71], based on the on-sensor compression architecture, the authors further implemented off-array coding scheme [70, 71].

Image sensors integrated similar pixel-level processing element to remove temporal redundancy have been widely reported in the literature. In [72, 73], a vision-sensor integrated pixel-level processor which can asynchronously respond to changing events between frames. Architecture of the pixel combines an active continuous-time logarithmic photosensor and a well-matched self-timed switched capacitor amplifier. Once the changing of the illumination intensity exceeds a threshold, a request will be sent out. Off-array address-event representation (AER) processor will respond to those request asynchronously. In [74], pixel-level processor is embedded into an APS array. Two capacitors are used to buffer the pixel value of the newly capture frame and the previous frame, respectively. The differential value between the two buffered pixel values are compared with a threshold during the readout phase. In [75], a sensor array with the QVGA resolution integrated lossless temporal compression has been proposed. Only pixels with a new capture value different from the previous one will be sampled, resulting in reduction of power consumption, bandwidth and memory requirement. A change detector is implemented in pixel-level circuit to perform the proposed processing.

References [22, 7680] proposed an APS array integrated pixel-level parallel prediction image decomposition algorithm and Set Partitioning In Hierarchical Trees (SPIHT) coding scheme. The proposed prediction scheme enables image decomposition with lower computational complexity, while compared with standard wavelet-based algorithms as detailed in [77]. Within each block, except for the top-left corner, all the pixel values are predictive values calculated from summation of weighted neighboring pixel values. The proposed prediction-decomposition algorithm is performed parallel in pixel-level circuit, which enables a higher frame rate as claimed in [80]. The pixel-level charge-based computational circuit consists of comparator, capacitors, and control logic circuit (totally thirteen transistors and four capacitors in each pixel) as illustrated in Figure 10. Four different pixels are required for one-level prediction. Capacitors are used to buffer current pixel values for residual computation and to buffer nearby pixel values for the prediction decomposition processing. For an level image composition, only raw pixel values are required to be recorded. More details of a charged-based computational circuit can be found in [76, 78]. Computational error caused by parasitic capacitances were analyzed in [22]. Estimated coefficients for nine subbands obtained from both the gray gradient test image and the test pixels built outside the sensor array show that the coefficients are consistent for different subbands and close to the approximate theoretical values. Another main source of computational error is the error caused by charge injection. The latter can be minimized by improving the prediction accuracy. In order to further improve the performance of the sensor array, CDS is used to reduce the FPN inherent in the mismatch of threshold voltage. The linearity of the proposed prediction decomposition scheme enables the integration of the residual computation circuits in the signal path of the CDS function.

3.2.2. DPS Array Integrated Compression Processor

Among different architecture of CMOS image sensors, the digital pixel sensor (DPS) is the most recently proposed architecture [8185]. It integrates massively parallel conversion and provides digital readout circuit, enabling easy implementation of parallel processing. Higher processing rate can be expected while using DPS to acquire raw pixel values for on-chip image-compression processing.

References [23, 86] proposed a DPS array sensor integrated together with an on-chip adaptive quantization scheme based on Fast Boundary Adaptation Rule (FBAR) and Differential Pulse Code-Modulation (DPCM) procedure followed by an online Quadrant Tree-Decomposition (QTD) processing. The overall sensor array is operated in two separate phases: integration phase and readout phase. As illustrated in Figure 11, the integration phase begins from a reset operation, during which voltage of the photodiode is pulled up to , and a global counter is reset to all one at the same time. After resetting, the voltage of the photodiode is discharged proportionally to the illumination intensity, meanwhile, the global counter began down counting. Once the voltage of the photodiode achieves a reference voltage value, the counting result of the global counter will be written into a pixel-level memory as the digitized raw captured pixel value. During the readout phase, the sensor array is used as a matrix of 8-bit memories. In the off-array logic, QTD compression is performed by building multihierarchical layers of tree corresponding to a quadrant using Hilbert scanning. By using the Hilbert scanning scheme, the storage requirement for adaptive quantizer is reduced, as Hilbert scanning scheme keeps spatial continuity, while scanning the image quadrants. The raw pixel values are quantized by an adaptive quantizer designed based on FBAR. FABR minimizes the th power law distortion, which is the most commonly used distortion measure scheme. One interesting feature of the proposed work is that the FBAR algorithm is performed on the predictive error rather than the pixel itself using DPCM, which results in compressed dynamic range.

3.3. Compressive Acquisition Image Sensor

In the mentioned applications, the image is first acquired, and then compressed. Raw pixel data values are buffered during readout phase in off-array storage or directly buffered into in-array storage before being compressed, resulting in a very high storage requirement. In [24], a new concept of compressive acquisition has recently been proposed to perform compression online during the image capture phase, prior to the storage phase. This approach shifts the design paradigm from the conventional: Capture Store Compress to the newly proposed: Capture Compress Store as illustrated in Figure 12. The paper illustrates the potential advantages of such a new design paradigm namely: (i) reduced silicon area required for the DPS, (ii) reduce on-chip storage requirement, and (iii) compression processing integrated within the pixel array, enabling the concept of parallel processing. The basic idea behind a compressive acquisition CMOS image sensor is to combine image compression processing with image capture, so as to reduce the on-chip storage requirement. There is obviously a tradeoff between reduction of storage requirement and the complexity of pixel-level circuit design. The key question affecting the success of this approach is whether it is possible to simplify the compression processing to enable a powerful and yet simple pixel design.

Different compression algorithms that are suitable to be integrated with a compressive acquisition CMOS image sensor have been proposed. In [24], a block-based compression algorithm, which can be implemented at the pixel-level with very limited hardware complexity, and hence limited overhead in terms of silicon area was proposed. The overall sensor array of digital pixel sensors is divided into blocks. Within each block, the brightest raw pixel value is recorded in block-level memory. The differential value between pixel value and the brightest pixel value within a same block is calculated and quantized during the image-capture phase. Only the quantized differential values are recorded instead of 8-bit raw pixel values.

In [87], an online 1-bit predictive coding algorithm using Hilbert scanning scheme that can be implemented in pixel-level circuit is proposed. Compared to conventional DPS that integrated 8-bit pixel-level memory, the proposed architecture reduces more than half the silicon area by sampling and storing the differential values between the pixel and its prediction, featuring compressed dynamic range and hence requiring limited precision. In order to further improve the reconstructed image quality, Hilbert scanning is used to read out the pixels values instead of conventional raster scanning. The Hilbert scanning path is all carried out by hardware wire connection without increasing the circuit complexity of the sensor array. Reset pixels are inserted into the scanning path to overcome the error accumulation problem inherent in the predictive coding.

In [88], an online compression algorithm is proposed for compressive acquisition CMOS image sensor which removes spatial redundancy by sampling selected pixel values instead of all the pixel values within the overall sensor array. Block-based division is performed on the overall sensor array. In each block, the raw pixel values are reordering after captured. Only the brightest and darkest pixels are sampled and stored in block-level memories. In image-reconstructing phase, different block models are built based on analysis of the relationship between neighboring blocks. Taking advantage of the feature of DPS, the online sorting process can be implemented in pixel-level circuit with very low overhead.

In [89], the concept of compressive acquisition CMOS image sensor is extended to temporal domain. An efficient intra- and interframe compressive-acquisition scheme which removes spatial and temporal redundancy prior to storage is proposed in this paper. The overall sensor array is divided into quadrants, whereby a brightest pixel value within each quadrant is selected as a quadrant value to represent the quadrant. A reference array is built to track the quadrant value changes between different frames. The reference array is updated adaptively by an off-array judge logic after capturing one frame. The background versus nonbackground classification results referring to the comparison result between the reference array and the matrix of all the quadrant values are used to remove the data redundancy before transmission.

3.4. Performance Analysis and Comparison

Image sensors with integrated on-chip image compression featuring high-resolution, high-throughput, high-frame rate, or low power consumption, have been proposed in recent years. The chip reported in [58] integrating JPEG processing is operated in subthreshold domain enabling very lower power consumption (80 μW/frame). 1.4 GMACS (Giga Multiply Accumulate Operations Per Second) throughput with SVGA imager resolution is achieved in [56], while in [57], 4 GMACS throughput with HDTV 1080i resolution is achieved using an image sensor integrating an on-chip DCT processing. In [22], a frame rate of 3000 fps can be realized with a power consumption of 0.25 mW.

Table 1 compares the features of the above-cited designs. The traditional design paradigm implements the image pixel sensor and the image-compression processor as two separate functional units connected by an off-array scan-based readout interface. Typically, more complexity standard compression algorithms can be implemented in an off-array processor, resulting in a higher quality of the reconstructed image. A PSNR value higher than 30 dB can be easily achieved as shown in the table. Different architectures of A/D convertor are used in those designs as compared in Table 2.

However, traditional scan-based readout circuits have limitations on the volume of data that can be readout. While with each new generation of image sensors, the image resolution and frame rate keep increasing, as illustrated in Table 3. CMOS image sensor integrating pixel-level processor have been studied to solve this issue [2224, 70, 73]. Focal-plane compression processing can be implemented in either the analog domain [22, 70, 72], or digital domain [23, 24]. A higher processing rate can be expected by using focal-plane processor. In [22], pixel-level charge-based computational circuit are used to perform the prediction-decomposition algorithm. Processing rate of the proposed system can be as high as 3000 fps. However, due to the limitation of the processing capability in pixel-level processor, the reconstructed image quality is reduced by performing processing in pixel-level processor. Image-sensor array integrated pixel-level parallel processor enables a processing rate up to 10000 fps has also been reported as in [90].

Using digital-domain processor, a smaller pixel size can typically be expected, as no capacitor is required in pixel-level circuit. However, the noise introduced by using pixel-level circuit for compression processing is always a big disadvantage of focal-plane processor. In general, dark current is the leakage that flows through the photosensitive device even when no photons are entering the device, and can result in both FPN and temporal noise. The switching activities, due to the embedded pixel level processor, can result in higher substrate noise near the photosensitive device. As a consequence, an increase in temporal noise correlated with the switching activities may result. This can, however, be minimized if switching is prohibited during the integration cycle.

4. Conclusion

Compression is one of the most demanding image processing tasks. CMOS technology enables the integration of sensors and image processing, making it a very interesting and promising technology for future image systems. A review of the state of the art on-chip image compression integrated with CMOS image sensor has been provided. The paper begins from the traditional architecture in which image capture device and image compression processor are implemented in separate units connected by the readout interface. Architecture of image-sensor array integrated in-array pixel-level processor is used to improve the performance of the overall system. However, the image is always first acquired and then quantized and compressed. A new design paradigm referred to as compressive acquisition is also reviewed in this paper showing an interesting concept which performs compression processing during the capture phase prior to the storage phase, thus a reduction of in-array storage requirement can be expected. Recently proposed image sensors integrated together with on-chip image compression processing featuring high performance, such as high resolution, high throughput, high frame rate, or low power consumption, have been reviewed in this paper.

The development of CMOS image sensor featuring on-chip image compression has been very fast in the last decade. As the image resolution keeps increasing, the amount of data for each frame enhances, resulting in requirement of efficient compression processing capability or higher throughput capability. In addition, image sensors are widely used in consumer electronic applications, most of which are battery powered, lower power consumption capability is very important in nowadays image-sensor design. We believe that in the near future, the performance of CMOS image sensor integrated with on-chip image compression will enable to meet the requirement of modern imager system.

Acknowledgments

The authors would like to thank Mr. Berin Martini and Mr. Matthew Law for their helpful suggestions and discussion. This work was supported by a grant from the Research Grant Council (RGC) of Hong Kong, CERG Grant reference no. 610509.