Standard protocols for reliable data transmission over unreliable channels are based on various Automatic Repeat reQuest (ARQ) schemes, whereby the sending node receives feedback from the receiver and retransmits the missing data. We discuss this issue in the context of one-way data transmission over simple wireless channels characteristic of many sensing and monitoring applications. Using a specific project as an example, we demonstrate how the constraints of a low-cost embedded wireless system get in the way of a workable solution precluding the use of popular schemes based on windows and periodic acknowledgments. We also propose an efficient solution to the problem and demonstrate its advantage over the traditional protocols.

1. Introduction

Many wireless sensing devices can comfortably operate one way, that is, sending their samples at some intervals with no feedback from the receiving end, assuming that occasional losses are acceptable. There exist, however, a few sensing applications where all samples are considered important. They involve systems where the samples represent a certain process to be analyzed at the receiver, and the fidelity of that analysis is critical. Many areas of medical monitoring, especially those dealing with tracing and diagnosing heart activity, fall into this category. Even though one may argue that occasional gaps in the sampled data can be filled by interpolation or other “guessing,” the community of health care professionals is not receptive to such arguments. On top of the understandable obsession about the utmost quality of data constituting the basis for life-saving diagnosis, medical procedures are prone to (often uninformed) public criticism and litigation. To prevent it, the vocabulary of terms characterizing the precision of medical diagnostic procedures must exclude phrases like “almost all data” and “approximate records.”

From the engineering point of view, one would like to build a practical device with the minimum cost, where by “practical,” we understand one that works and fulfills the expectations of its users. Even if those expectations are high, an overdesigned device is bound to cost more than one that meets those expectations with the minimum expense of resources, be it memory, CPU power, or RF bandwidth. Notably, the amount of RF bandwidth needed by the device affects more than just the monetary cost of the project. The RF spectrum is bound to be more and more polluted everywhere, and especially so in places like health care facilities, where the multitude of wireless sensors will have to compete for bandwidth to deliver all the samples to their collection devices. So, we cannot just say that money is no object for the kind of reliability requirements inherent in medical applications, and for example, nonchalantly overdesign the device for bandwidth. Our goal should be rather to find the right set of algorithms and protocols to accomplish the reliability objectives with the minimum bandwidth possible. In addition to reducing the raw cost of the device, this approach will also result in an “environmentally friendly” design, and if only to our own immediate advantage, will allow us to deploy more devices within a given perimeter.

The problem addressed in this paper arose during the design of a wireless sensing device for heart monitoring based on ballistocardiography (BCG [1, 2]). The primary function of the device, dubbed HDL in the sequel (for Heart Data Logger), was to collect data samples from a set of sensors, store them temporarily in local flash memory, and transmit them reliably (in near real time) to a workstation (which we will call the CPP) over a wireless channel. (For collection and processing point.) Cost considerations combined with restrictions regarding the RF bandwidth, as well as demands for long sustained battery operation, led us to base the wireless link on a cheap low-power RF device driven by a microcontroller.

The most challenging element of the design was the protocol for transmitting the sampled data to the CPP. Even ignoring the losses, the amount of bandwidth needed to transmit the samples in real time approached the physical capability of the RF module. A by-the-book implementation of a two-way window-based ARQ scheme with periodic (sparse) acknowledgments and retransmissions [3] brought the system down to its knees. Regardless of how selective the acknowledgments were, the very fact that the transmitter had to expect feedback and make room for it within the stream of outgoing data rendered its operation extremely inefficient. Consequently, we have devised and studied alternatives to those obvious and popular schemes and arrived at a solution meeting our objectives.

Besides addressing a specific problem, our paper demonstrates that the realm of small embedded wireless systems brings about a bag of idiosyncratic constraints which often force us to look for solutions off the beaten path. Unfortunately, the high-level approach to transport and application-layer protocols, pervading most academic research, tends to ignore the kind of mundane low-level implementation constraints that proved decisive in our study. To a large extent, this is yet another fallout from protocol layering, which has met with consistent criticism in the world of wireless communication [49]. One has to resort to cases of product engineering to show those issues in the proper light of their highly practical relevance.

2. The System

2.1. General Outline

Ballistocardiography [1, 2] is a method of collecting and interpreting data about heart action by measuring the acceleration of body surrounding the heart area. The acceleration is detected and measured by sensors attached to the body and transformed into digital data samples, which are subsequently analyzed and visualized by DSP software. In our design, the accelerometers are connected (by wires) to the HDL device, which is responsible for the analog-to-digital conversion, intermediate storage of the data, and its transmission to the CPP for analysis and visualization.

The device is equipped with a small amount of flash memory which plays a dual role. From the viewpoint of transmitting BCG samples to the CPP, it acts as a buffer compensating for the transmitter's jitter and allowing the node to retransmit missed samples. It also functions as a simple database storing the last few sample streams taken from the subject, which can be transmitted retroactively upon request from the CPP.

2.2. Hardware and Software

The essential hardware components of the HDL are the MSP430F1611 microcontroller [10] and the CC1100 RF module [11]  (both by Texas Instruments). RF band restrictions prevented us from using Bluetooth for the radio link (the 916 MHz band was practically the only option), although a variant of the HDL utilizing a Bluetooth module was built and tested. Another reason for rejecting Bluetooth was the considerably larger hardware cost as well as significantly increased power requirements. Moreover, the arcane rules for device pairing and the consequent long and nondeterministic delays in binding the HDL to its CPP (especially with other Bluetooth devices present in the neighborhood), turned out to be prohibitively troublesome.

The microcontroller was programmed under PicOS [8, 1214], which is a convenient and highly efficient operating system for small-footprint devices, capable of structural multithreading within the confines of the tiny RAM available on low-end microcontrollers. In fact, MSP430F1611 is the largest representative of the MSP430 family, with 10 KB of RAM, which turned out to be more than needed. Our primary concern was to provide an RAM buffer for accommodating the samples before storing them in flash, to account for the occasional hiccups caused by the various special conditions, for example, the need to erase before write on the boundary of a nonempty block. To maximize the life of the flash memory, we avoid unnecessary erase operations and balance the usage of all its segments. Consequently, the procedures (system calls) writing data to flash may block awaiting the moment when the segment to be written gets into the proper state.

The input to the HDL consists of eight analog signals arriving from the accelerometers. Those signals are fed to the eight ADC ports of MSP430F1611 and converted into 8 12-bit values at the rate of 𝑓𝑠=500 conversions per second, yielding 8×12×500=48000 bits per second of incoming data. That rate is a required standard for a diagnostic-grade digital representation of the BCG signal. To avoid confusion, from now on, by a sample we will mean a single set of 8 values collected every 1/500 of a second, while the complete series of samples sent for processing to the CPP will be called a take. In other words, a take represents a complete measurement, which is processed and visualized at the CPP for the purpose of assessment or diagnosis.

The ADC converter as well as the sample collection process are turned on upon a request from the CPP. Such a request specifies the duration of the take in seconds, which is transformed by the HDL into the corresponding number of samples. Those samples are then collected and stored in flash memory. They can be transmitted in parallel with their collection, or merely collected and stored locally to be transmitted later. The device employs a simple differential compression scheme applied at the stage when the 12-bit ADC samples are repackaged into 48-byte blocks, which are the storage/transmission units. Owing to the nature of the BCG data, the compression scheme brings about highly consistent 45% average savings, which practically never get below 42%. This means that a typical 48-byte block accommodates over 7 samples. As with all loss-less compression techniques, it is theoretically possible (although unimaginable in practice) that the scheme will spontaneously inflate the data size by up to 29%.

The range of data transmission rates available to CC1100 is adjustable within a certain interval, with the raw limit around 200 kbps, which, considering the coding (Manchester), framing, and interpacket spacing reduces to about 50 kbps of effective rate (assuming one-way back-to-back packets).

3. Data Communication

3.1. The Problem and Its Classical Solutions

The amount of bandwidth needed to maintain a connection between an HDL and its CPP is trivial, except for the transmission of takes from the HDL to the CPP. Thus, we will focus on this highly asymmetric communication scenario, whereby a significant amount of data is transferred essentially in one direction. The classical problem of reliable data transmission over an unreliable link is formulated in the context of two parties, one of them being the sender (S) and the other the recipient (R), as shown in Figure 1. The setup assumes two separate (possibly logical) channels. With the reverse channel, the recipient is able to convey feedback to the sender, for example, to request retransmission of the missing (damaged) packets.

The simplest solution to the problem has been known as the alternating bit protocol (ABP)  [15, 16], and consists in acknowledging every single received packet by the recipient. Generalizations and improvements upon this simple scheme are known under the generic name of ARQ protocols [3]. Their objective is to reduce the amount of traffic in the recipient-to-sender direction, and provide for smooth operation in the face of nontrivial propagation delays between the two parties [17]. They typically involve a window, representing the limit on the number of outstanding (i.e., sent but unacknowledged) packets or bytes that the sender is allowed to send ahead. In the simplest case, a positive acknowledgment referring to a specific packet indicates that all packets up to and including the acknowledged one have been received [18]. Some schemes employ negative acknowledgments [19, 20] to explicitly describe what has been missing, some others rely solely on timeouts. With the latter, having received no acknowledgment for an excessive amount of time, the sender will begin retransmitting packets from the first unacknowledged one [19].

3.2. System Conditioning

The most difficult problem facing an ARQ scheme in our system is the utmost simplicity of the radio link and the lack of a reasonable reservation mechanism that would allow us to implement reliable and deterministic separation of the two logical channels shown in Figure 1. Based on our estimates in Section 2.2, the available bandwidth is already close to the minimum required to sustain the data transfer alone. Consequently, any attempts to impose extra structure on that bandwidth (e.g., acknowledgment slots akin to 802.11 [2124]), while possibly facilitating orderly delivery of data packets, would significantly reduce the amount of bandwidth available to those packets.

CC1100 comes equipped with rudimentary tools for collision avoidance [11]. The module is able to recognize radio activity in the neighborhood, based on a definable threshold, and automatically hold its own transmission until the activity is gone. The system can take advantage of this function to implement a medium access control scheme facilitating coexistence of multiple nodes within their mutual range. Owing to the fact that our design admits such a situation, we would like, as much as possible, to provide for a social behavior of the multiple HDLs present in the same area. Realistically, we cannot hope to achieve more than one take transfer at a time. However, we should be able to have multiple HDLs reporting their status to the CPP and responding to simple requests effectively in parallel.

To this end, our driver of the RF module implements a simple listen before transmit (LBT) scheme, whereby a node perceiving a radio activity before transmission backs off to avoid a collision. Such a scheme greatly facilitates low-bandwidth communication among multiple nodes, but it does waste bandwidth. Note that all request and status messages are very short. Thus, complex handshakes of the RTS-CTS-DATA-ACK variety are completely useless (and would be harmful [25]) under the circumstances.

Under ideal conditions, that is, no interference from other nodes, the HDL transmitter is able to send packets back-to-back reasonably fast, with minimal interpacket spacing. However, as soon as the collision-avoidance mechanism kicks in (i.e., foreign activity is sensed and backoff is employed), it “loses its step” and may remain blocked for a relatively long time. This is because the time intervals of the collision-avoidance scheme are measured in tens of milliseconds; the RF module is slow to respond to the changes in its status and exhibits a considerable inertia in its built-in LBT mechanism. Besides, the smallest sensible backoff window is about 20 milliseconds.

3.3. Data Formats

Data exchanged between the HDL and its CPP consist of packets framed as shown in Figure 2. This layout, including the maximum payload (data) length, is enforced by the physical characteristics of CC1100. While theoretically, by resorting to some tricks, it would be possible to have payloads of arbitrary length, the 55-byte maximum for the data component is already on the large side, considering the reliability of single-packet reception. (CC1100 uses a 64-byte internal FIFO for storing the outgoing/incoming frame. When the total length of the packet exceeds the FIFO size, the packet must be sent and received “in pieces,” which is not very practical.)

The first logical component of a packet is the link ID, that is, the temporarily unique session identifier assigned by the CPP to one particular HDL. This field is used to tell apart multiple HDLs supervised by the same CPP. The single-byte RQ field identifies the packet type, that is, the kind of request or response carried in “data.”

A packet representing a take fragment always carries a block of 48 bytes encoding a compressed portion of successive samples (see Section 2.2). It also includes the block number, such that its position within the take can be determined independently of other blocks. The data portion of such a packet consists of 52 bytes partitioned between the block number and the 48-byte chunk of compressed data.

3.4. Traditional ARQ Schemes

The first transmission protocol that we tried in our design operated as follows. When the CPP wants to retrieve a take from an HDL, it sends to the HDL a short SEND request that, in addition to the take identifier, specifies two parameters: 𝐹—the starting block number, and 𝑁—the number of blocks to be retrieved. The same request format is also used as a positive or negative acknowledgment. Initially, to start the retrieval, the CPP sends a request with 𝐹=0 and 𝑁 equal to the total number of blocks in the take.

Having received the initial SEND request, the HDL starts to transmit the requested blocks consecutively. As the block number is included in every packet, the receiving node can tell which blocks have made it and which have been lost.

Conceptually, when the CPP receives a new block of data that adds to the continuous sequence of already received blocks belonging to the requested take, it should acknowledge the reception with a new SEND request specifying the next expected block number. A block number behind the last block of the sample means that the entire take has been received. This is how the simplest window-less version of the scheme could work.

To avoid too many acknowledgments, the HDL is allowed to use a window, that is, having sent the first block, it is permitted to send a few blocks ahead without waiting for separate requests for them. Specifically, the HDL maintains two counters: NextToGo and NextToAck. The first counter tells the number of the next block to be transmitted, while the second one points to the first block for which an acknowledgment (meaning an SEND request) has not been received yet. The device is allowed to keep sending blocks for as long as NextToGoNextToAck<𝑊, where 𝑊 is the assumed window size. To facilitate this operation, the CPP will refrain from acknowledging every single block. Ideally, it would like to get away with a single acknowledgment per the entire window.

Having reached the end of window, for as long as NextToAck is not advanced, the HDL keeps retransmitting the last block (at some reasonably short intervals). If the CPP has missed some packets from the window, it should repeat the last SEND message until the missing fragment arrives. A duplicate SEND request for the NextToAck block is viewed by the HDL as a request to retransmit all packets starting from NextToAck. Note that as soon as the hole (or holes) in the received data have been plugged by the CPP, it can issue a SEND message whose offset will jump to the end of the received continuous set. This will tell the HDL to abandon the retransmission and move ahead.

In general terms, the described scheme falls into the standard family of Go-Back-N ARQ protocols and, in particular, lies at the foundation of TCP [26]. The key to its effectiveness in the application at hand is in the right selection of the window size 𝑊, the intervals between SEND messages, and the timeouts. Notably, the HDL has to make sure that the acknowledgments can be received at all; thus, it cannot be overly aggressive with transmissions.

Note that the problem is quite idiosyncratic of our RF framework. In the classical analysis of the ARQ schemes, it is commonly assumed that the cost of receiving an acknowledgment has nothing to do with the cost of sending a data packet. The primary role of the window in such a system is to compensate for the end-to-end propagation delay by turning the channel into a pipe [27]. In our case, the propagation delay is insignificant, and the window is used as the grain of acknowledgment—to avoid too frequent “channel reversals” and interruptions in the “proper” stream of data arriving from the HDL.

To the best of our knowledge, the problem of implementing an ARQ scheme over a wireless channel has never before been looked at from this particular angle. Most of the previous efforts in this area have focused on two issues: (1) adapting TCP for wireless connections [2830], and (2) lowering the effective packet error rate in cellular channels [9, 31, 32]. In the second case, the role of ARQ is not to absolutely guarantee the delivery of all data, but to reduce the number of losses, possibly in combination with various forward error correction (FEC) techniques naturally employed in cellular systems [9]. All those variants of ARQ assume that the requisite acknowledgments are delivered over a separate channel whose interference with the primary (data) channel is either completely immaterial, or its nature is much less destructive than with our “channel reversal.”

The amount of time needed to expedite a single data packet containing a block of the requested take is 𝑡𝑝5 milliseconds. Adding to this the various unavoidable delays in the OS and in the driver, we conclude that with back-to-back transmissions we are able to send about 𝑓𝑝150 blocks per second. Considering that the frequency of sampling 𝑓𝑐 is 500 samples per seconds, which translates into about 70 blocks on the average (see Section 2.2), we find ourselves comfortably within the realm of feasibility. One should realize, however, that this optimistic conclusion only holds under the assumption that the stream of transmitted data is not disrupted for excessively long intervals. As it turns out, any attempts to provide reception opportunities (LBT) between a pair of sent data packets incur a time overhead 𝑡𝑎 of order 15–30 milliseconds, which is 3–6 times more than the transmission time of one block.

Consequently, the performance of the traditional ARQ scheme, in the version described above, was pathetic regardless of the setting of its parameters. With the LBT delays and the extra losses incurred by the interference from acknowledgments (backoffs), the effective rate went down to about 15–20 blocks per second. This way, the amount of time needed to transmit a 20-second take was close to two minutes.

Most of the relevant insight into the problem can be obtained from a very simple model based on the following assumptions: (1)the cost (time) of a packet (block) transmission within a window is fixed and equal 𝑡𝑝;(2)the time penalty of receiving a feedback from the recipient is also fixed and equal to 𝑡𝑎;(3)the probability of a packet error, expressed as 𝑃𝑒, is fixed and independent.

Let 𝑊 denote the window size. With sparse acknowledgments separated as widely as the window size, the approximate amount of time needed to transmit a take of 𝑁 blocks can be expressed as𝑇(𝑁)=𝑊𝑐×𝑡𝑝+𝑡𝑎+𝑊𝑐𝑖=0𝑃𝑖,𝑊𝑐×𝑇(𝑁𝑖),(1) where 𝑊𝑐=min(𝑊,𝑁) and 𝑃𝑃(𝑛,𝑚)=𝑒×1𝑃𝑒𝑛if𝑛<𝑚,1𝑃𝑒𝑚otherwise(2)is the probability that exactly 𝑛 initial blocks of 𝑚 total have been transmitted successfully. This yields the following recursive formula: 𝑊𝑇(𝑁)=𝑐×𝑡𝑝+𝑡𝑎+𝑊𝑐𝑖=1𝑃𝑖,𝑊𝑐×𝑇(𝑁𝑖)1𝑃0,𝑊𝑐,(3) with 𝑇(0)=0.

If the acknowledgments are in fact sent independently (based on loose timeouts at the recipient), 𝑡𝑝 has to be large enough to provide a reception opportunity after every packet. On the other hand, one can try to make the windows explicit and formally request that the feedback be only sent at the window boundary. This may require special signals (short packets) at the end of a window—to notify the recipient that the feedback is expected and should be provided. While those “signals” may take more time than a simple reception opportunity for an acknowledgment, the packets sent within a window can be spaced tightly (no LBT) avoiding much of the overhead. The advantage is illustrated in Figure 3, which compares the normalized time of transmitting a long series of blocks under different scenarios, with the two variations discussed above represented by curves I. and II. The normalized time is expressed as the ratio of the actual transmission time 𝑇 to the time 𝑇𝑠 required to collect 𝑁 samples, where 𝑁 is the number of transmitted packets (thus, any value above 1 should be viewed as a failure to keep pace with the sample collection process). The window size was 10 blocks. In the first (spontaneous) case, 𝑡𝑝 is set to 20 milliseconds, which roughly corresponds to the minimum reasonable packet separation interval that would provide any reception opportunities at all. The cost of receiving an acknowledgment was set to 30 milliseconds in both cases, which was rather optimistic. No detailed experimental tests were carried out for these schemes, as they were immediately seen to be practically useless.

Indeed, once we conclude that acknowledgments should only be sent at explicit window boundaries, it makes little sense to follow the Go-Back-N approach. Instead, it would be much wiser to selectively indicate in the acknowledgment the exact blocks that were missing in the window. Then, the next window would start with the missing blocks from the previous one. The performance of this new scheme is captured by (1) substituting 𝑚𝑛×𝑃(𝑛,𝑚)=1𝑃𝑒𝑛×𝑃𝑒(𝑚𝑛),(4) which represents the probability that exactly 𝑛 of the 𝑚 blocks have been received correctly.

The solution for 𝑊𝑐=10 is depicted by curve III. in Figure 3. Its superiority over the Go-Back-N approach is clear. One of its desirable features is the strict monotonicity with respect to the window size, which is not the case with the Go-Back-N scheme. This is because, with Go-Back-N, an error occurring within a large window will trigger the retransmission of the entire tail. While a jump ahead can be forced by the CPP when it detects that the hole has been plugged (this possibility is not captured by our simple model), large windows will tend to contain multiple erroneous packets, which will nullify the impact of this feature.

The difference in character between a Go-Back-N protocol and a selective scheme applied to our system is shown in Figure 4. Parameters 𝑡𝑝 and 𝑡𝑎 correspond to the values characteristic of our platform, and the probability of error 𝑃𝑒=0.1 is relatively large, but not uncommon in our application.

3.5. The Workable Scheme

The monotonicity of the simple selective scheme with respect to the window size 𝑊 (Figure 4) suggests that the window size should be as large as possible. Note, however, that one factor tacitly ignored in the simplified performance model is the size of the feedback (acknowledgment) packet, which depends on the number of missing blocks. One would like to avoid a situation when the feedback message itself consists of multiple packets because then we would have to cope with two more issues. (1)Multiple channel reversals, which, as we explained in Section 3.2 tend to waste a disproportionate amount of bandwidth.(2)Reliable reception of the feedback. In a situation when the feedback message consists of multiple packets, simple persistent and idempotent schemes will not work, which will bring about more bandwidth wastage.

As a side note, let us mention here a third issue, usually ignored in protocol design studies, namely, the complexity of the implementation expressed in the mundane terms of code length. This parameter is not irrelevant in the realm of microcontrollers; even if the computational complexity of a program is acceptable, the program should not be too big, as it may not fit into the limited ROM.

As Figure 4 shows diminishing returns for pushing the window size beyond a “reasonably” large value, we propose to make the window size variable and adopt the following set of rules: (1)the window size is determined by the (maximum) number of requested blocks whose description can fit into a single request/feedback packet;(2)the end of a window is explicitly indicated by the sender (the HDL) in a persistent manner until noticed by the recipient (the CPP);(3)Request packets are idempotent and they are sent persistently by the CPP, until it notices the arrival of a new window.

This approach minimizes the number of channel reversals and results in a scheme whereby the CPP repeatedly polls the HDL for the missing blocks and then expects to receive them in the next window of packets, doing so until all the blocks have made it. At every step, the CPP asks for the maximum number of blocks that can be described in a single request packet. The generic algorithm for take extraction executed by the HDL can be described as follows. (1)Wait for a request. This is the main loop executed by the HDL while being idle. Having received a request, proceed at 2.(2)Extract from the request the list of blocks to be sent and send them blindly back-to-back (no LBT) until the last requested block has been expedited. Then proceed at 3.(3)Send periodically, at reasonably sparse intervals and with LBT on (as to enable reception opportunities), a short packet indicating the end of window. A natural choice for such a packet is an empty block. Keep doing so until a new request arrives from the CPP. Then proceed at 4.(4)If the request is NULL, that is, no more blocks are needed, terminate the operation and proceed at 1. Otherwise, proceed at 2.

A flowchart view of the above algorithm is shown in Figure 5, with the ovals representing waiting states, and the triangle (labeled Rq) indicating the event consisting in the reception of a request packet. For transmission while sampling, the algorithm starts with one implicit round whereby the blocks are sent back-to-back while being collected from the ADC. This constitutes the first window consisting of all blocks contributing to the take. Following that round, the algorithm continues at step 3.

The CPP's end of the protocol looks like the following. (1)Initialize by marking all blocks to be received as absent. Then continue at 2.(2)Find out which blocks are still absent. If none, send a NULL request and enter the IDLE state. Otherwise, prepare a single request packet that covers as many of the missing blocks as possible and proceed at 3.(3)Keep sending the request packet at reasonably sparse intervals until a block packet arrives from the HDL. Then proceed at 4.(4)Keep receiving the blocks and storing them until you see the end-of-window packet (an empty block). Then proceed at 2.

Figure 6 shows the flowchart view of the above algorithm. The Bk triangle represents the reception of a block packet from the other party. For an immediate transmission-while-sampling request, the first request issued by the CPP will instruct the HDL to initiate the sampling procedure, and will implicitly trigger the first “total” window consisting of all the blocks contributing to the take.

The simple (idempotent) nature of requests and block transmissions automatically takes care of all possible loss scenarios and races. Note, for example, that all blocks of a given window requested by the CPP can be lost. Thus, when the HDL arrives at step 3 of its algorithm, it will eventually receive again the original request, which persists at the CPP until it sees the first (any) packet of the requested window. Consequently, it will just retransmit the last requested window in its entirety. Similarly, the CPP does not assume that its single NULL request (sent in step 2 to indicate the completion of the transfer) always makes it to the HDL. Rather, having assumed the IDLE state, it will reply with a NULL request to any end-of-round packet received from the HDL, to eventually force it to the IDLE state as well. This is illustrated in Figure 6 with the (spurious) Bk event in the upper left area of the flowchart.

3.6. Request Formats

The key to a good performance of the protocol described in the previous section is the efficient description of missing blocks in the request packet. We have implemented two formats of such requests.

With format 1, a request packet is filled with block numbers, each number occupying three consecutive bytes, up to the maximum of 16 values (48 bytes). A simple trick is played to describe individual blocks as well as continuous ranges. Suppose that 𝑐0,,𝑐𝑛1 is the sequence of block numbers specified in a request packet. These numbers are interpreted by the HDL as follows. (1)Set 𝑖=0.(2)If 𝑖=𝑛, done (all requested blocks have been sent). Otherwise, set 𝑐𝑎=𝑐𝑖 and 𝑖=𝑖+1.(3)If 𝑖=𝑛 or 𝑐𝑖>𝑐𝑎, send block 𝑐𝑎. Otherwise, set 𝑐𝑏=𝑐𝑖 and 𝑖=𝑖+1. Send the blocks 𝑐𝑏,,𝑐𝑎. Continue at 2.

Thus, for as long as the block numbers are increasing, they describe individual blocks, while a decreasing number together with its preceding number are taken together as the boundary of a continuous range (chunk) of blocks. Note that the pair 𝑁1,0 (where 𝑁 is the total number of blocks in the take) requests the (initial) total window comprising the entire take.

With format 2, it is possible to use bit maps to efficiently describe sizable hollow fragments. The collection of requested blocks is described by a sequence of elements, which may identify continuous ranges of blocks, as well as random selections represented by bit maps. Each element starts with a header, which consists of one (leading) byte followed by a three-byte block number, as shown in Figure 7. The most significant bit of the leading byte, labeled 𝑇, distinguishes between two element types: origin (𝑇=0) and chunk (𝑇=1). An origin type element requests explicitly the block number ORG to be sent to the CPP, and also sets the current location within the take to the block number ORG+1. If ms is nonzero, then ms consecutive bytes following ORG are interpreted as a bit map requesting selected blocks from the take fragment starting at block number ORG+1. This is the block number corresponding to the first bit in the map.

The second element type (chunk) describes a continuous range of blocks starting at the current position, as determined by the previous sequence of elements within the packet. The three bytes following the header byte encode the number of blocks falling into the chunk. If the chunk element is the first element of the request, that is, the current position has not been explicitly defined, it is assumed to be zero. Thus, the complete take (all blocks) is described by a single element whose first byte is 0×80 and the following three bytes contain the total number of blocks in the take.

Similar to an origin element, a chunk element may specify a map (its ms field may be nonzero). The bits of such a map apply to blocks immediately following the chunk. Figure 8 shows a sample request fragment which calls for block number 215, then uses a bit map to select among blocks 216–287. Note that one byte of the map covers 8 consecutive blocks; thus, the 9-map bytes describe 72 blocks starting at block 216. The subsequent chunk element requests the continuous range of blocks 288–322 (33 blocks total). Finally, the second map applies to blocks 323–418. The total length of the request fragment shown in Figure 8 is 29 bytes.

One can think of several ways to generate requests (in either format). The problem is nontrivial, if we want to do it in an absolutely optimal fashion. For example, it may be OK to request some superfluous (already received) blocks, if the request can be shortened this way. This makes sense when a request that would not fit into a single packet can be thus made to fit. Note that the cost of including a superfluous block in the window (𝑡𝑝) is relatively low compared to the cost of handling an extra round.

The simple heuristics for format 2 used in our implementation of the protocol in the CPP work this way. (i)The first missing block is used as the ORG of the new request packet.(ii)Starting from ORG+1, consecutively numbered blocks are examined in bunches of 8 (corresponding to the bytes of the bit map). If at most one block per 8 is present (the map byte contains at most one zero), the bunch becomes a candidate for a chunk. If five or more consecutive bunches (at least 40 blocks) are collected this way, a chunk element is generated and it covers all the subsequent blocks whose 8 bunches contain no more than one superfluous block. Note that such an element takes 4 bytes, that is, less than a map covering five or more bunches.(iii)If a bunch is all zeros (all the blocks from the bunch are already present), it becomes a candidate for a skip, that is, advancement to a new ORG. A new ORG element is generated whenever we hit five or more consecutive bunches with this property.(iv)Otherwise, a map is built until either a chunk or a skip is encountered (according to the above criteria), or the request packet is completely filled up. Note that such a map can follow a chunk.(v)This procedure continues until the request packet is filled completely or there are no more blocks to request. When the round is over, following the reception of the requested window, a new request packet is generated according to the same set of rules.

3.7. Analytical Assessment

The scheme discussed in the last two sections can be viewed as a modification of the selective retransmission protocol described at the end of Section 3.4. The modification consists in making the window size variable. Except for the first round, in which the window covers the entire set of blocks, the size of every subsequent window is determined by the capacity of the request packet sent by the CPP, understood as the conveyed number of missing blocks. To estimate that number, suppose, as we did in Section 3.4, that packet errors are independent events occurring with probability 𝑃𝑒. Let us begin with format 1. Assume that we are at the end of the first round and our objective is to fill a request packet of size 𝑢 slots, where one slot accommodates one block number. The question we ask is as follows: what is the expected number of missing blocks that will be described by such a packet?

In an asymptotically interesting case, we are looking at a large (infinite) number of blocks to transmit, and there are always sufficiently many missing blocks to fill an entire request packet. Moreover, their configuration is nontrivial, that is, not all of them are missing (in which case just two slots would do). Whenever we hit a missing block, there are two possibilities. (i)The block is a “singleton,” that is, the next block is not missing. This event will occur with probability 1𝑃𝑒. In such a case, we will use exactly one slot of the packet to describe exactly one block.(ii)With the remaining probability of 𝑃𝑒, the next block is missing as well, and we have a continuous range of two or more missing blocks. In that case, we will use two slots and the expected number of blocks covered by them is equal 𝑘=2𝑃𝑘2×𝑘,(5) where 𝑃𝑘2 is the probability of exactly 𝑘2 consecutive errors (note that we already know that two consecutive blocks are missing).

Thus, the expected capacity of a format 1 request with 𝑢 slots is given by the following recurrence relation:𝐶𝑢=1𝑃𝑒𝐶𝑢1+1+𝑃𝑒𝐶𝑢2+1𝑃𝑒𝑖=2𝑃𝑒𝑖2×𝑖,(6) which transforms into 𝐶𝑢=1𝑃𝑒𝐶𝑢1+𝑃𝑒𝐶𝑢2+11𝑃𝑒,(7) with the boundary conditions 𝐶1=1 and 𝐶0=0.

The capacity of a format 2 request packet is more difficult to estimate (at least in all circumstances) because of the multitude of cases. Note, however, that in those scenarios when that format is expected to bring most help, the density of missing blocks will result in most of them being represented as bit maps. Then, the capacity of a format 2 packet can be simply approximated as 𝐶𝑢=𝐵×𝑃𝑒,(8) where 𝐵 is the maximum number of bits in the packet available for the bit map. This approximation is going to work well at least for a medium range of 𝑃𝑒.

Figure 9 compares the two functions for 𝑀=16 and 𝐵=352, which values match the actual parameters of request packets in our system. format 2 appears to clearly win, except for very low (less than 4%) and extremely high (larger than 97%) error rate. Note that approximation 6 ceases to work in those regions. For the low end, that happens when less than one of every 40 blocks is missing (around 𝑃𝑒=0.025), in which case skips will prevail over maps (see Section 3.6), but we can confidently say that below this “phase transition” threshold format 2 is going to be slightly worse than format 1, because its representation of individual blocks is less efficient (4 bytes per block instead of 3). Needless to say, the other extreme (error rates approaching 1) is irrelevant. For 𝑃𝑒 between 0.05 and 0.9, the simple formula (8) gives in fact a very good approximation, at least as long as packet errors are independent.

Using the values produced by formulas (7) and (8), one can toy with (1) and (4). One simple way to tweak that model to describe our scheme is to set in (1) 𝑊𝑐=𝑁if𝑁=𝑁0,𝐶𝑢(or𝐶𝑢)otherwise,(9) where 𝑁0 is the total number of blocks in the transmitted take. As formula (4) requires 𝑊𝑐 to be an integer number, we can restrict the application of (1) to those values of 𝑃𝑒 that generate integer values of 𝐶𝑢 (or 𝐶𝑢) and interpolate for other values.

Figure 10 shows the result of applying the new model to the case of transmitting 10 000 blocks. The two variants of our scheme are compared to two instances of the straightforward selective acknowledgment protocol with a fixed window size. Our protocols are represented by discrete points, to emphasize the fact that only certain values of 𝑃𝑒 can be handled by the model. In particular, the expected window size grows rather slowly for format 1, and only three integer values (16,17,18) show up for 𝑃𝑒<0.4. The value for 𝑃𝑒0.048 (window size 17) for format 2 has a question mark, as the point is close to the phase transition threshold and thus not reliable. Note that a standard selective acknowledgment scheme with fixed window size is bound to lose slightly, even when the error rate is zero, because of the need to interrupt transmission at the window boundary and momentarily reverse the channel.

4. Empirical Verification

4.1. Packet Losses

In a real-life deployment of an HDL-CPP system, one can distinguish two scenarios when a packet can be lost. First, even with the absence of explicit external interference (e.g., from another RF device), a packet can be lost “statistically” because of the background noise. Under normal operating conditions, which assume the maximum transmission range of 30 meters and the lack of interference from another simultaneous transmission (involving a different HDL), the rate of such errors is below 1% and they appear to be random and truly independent events.

The second scenario type involves an interference from other HDL/CPP setups operating nearby on the same channel. Losses in such a case can be longer and correlated, namely, runs of missed packets are more likely. While such situations are not representative of normal operation (and they are avoidable by the application, see Section 5.2), we also studied them to assess the resilience of our system to explicit RF interference causing losses in excess of 50% of packets.

4.2. Observed Performance

Table 1 shows the distribution of error runs under the “random” losses. By intentionally crippling the system, that is, attenuating the received signal level beyond normal operating conditions, we pushed the packet error rate over 30%. The attenuation was accomplished by reducing the transmission power, trimming the antenna, and/or moving the two nodes (HDL and CPP) further apart until the average observed packet error rate matched the first column of Table 1. The remaining columns show the fraction of all observed error runs (a consecutive sequence of erroneous packets was treated as a single sample in these statistics). As we can see, there are no long series of lost packets, which means that attempts at identifying runs (that would reduce the number of retransmission requests) will not be very successful.

Figure 11 shows the increase in the time of take transfer depending on the packet error rate under “random loss” conditions. The measured value is the ratio of the actual transfer time to the time with zero losses. The two sets of points correspond to the two request formats described in Section 3.6. At each point we also show the average number of rounds taken by each format (the upper number is format 1) for a 60-second (4000-block) take.

Note that for a low error rate (below 4%), format 1 turns out to be marginally better than format 2, which trend significantly reverses for higher error rates. This is explained by the slightly less efficient representation of sparse missing blocks by format 2 under very low error rates (see Section 3.7).

Scenarios with short RF interferences, for example, corresponding to the situation when one HDL is sending blocks, while some others exchange status messages with their CPP's, are not visibly different from the random scenario, with the appropriately adjusted packet error rate. Consequently, it is more interesting to look at the cases of large losses caused by two or more concurrent take transfers. By adjusting the distance (cross-interference) between the different setups, and looking at the behavior of a single HDL-CPP pair, we can obtain error rates above 50% and reaching 90–95%. Note that even in extreme interference scenarios, the loss is never 100%, which is mostly due to the capture effect [33, 34]. What we see at those higher loss rates is longer error runs.

Experiments illustrating the performance of our system under such large loss rates were carried out using the setup shown in Figure 12. The two HDL-CPP pairs were constantly transmitting long takes in parallel. The separation between the two pairs was adjusted (between 0.5 and 10 m) to match the prescribed average error rate.

Figure 13 presents the observed distribution of runs for three different average packet error rates. The length of each bar tells the percentage of all erroneous (lost) packets that belonged to runs of the corresponding length (marked on the 𝑥-axis). Runs of length 1 are not shown (their percentage can be trivially deduced as the difference between 100 and the length of the first bar). In particular, for the packet error rate of 0.9, 50% of all lost packets were lost in runs of 17 or more.

The actual performance of an HDL-CPP pair under heavy interference conditions is illustrated in Figures 14 and 15. These results were obtained by making one HDL of the setup in Figure 12 transmit a continuous sequence of samples (belonging to a dummy infinite take), while the other pair tried to exchange a sequence of forty 60-second takes. (The CPP of that pair was irrelevant and completely inactive.) For each transfer, we measured the total transmission time (until the arrival of the last missing sample) as well as the number of rounds needed to accomplish the transfer (the first transmission in response to the initial request was counted as round 1). Both measures were averaged over the 40 experiments. As before, the average transmission time per take is normalized assuming that a completely error-free transmission lasts one unit.

In terms of rounds, the superiority of the second request format is clear. Note that under enormously large error rates, a fewer number of rounds requires fewer request packets, which begins to positively feed back into the overall transmission time (the impact of losses among the request packets is smaller). Naturally, the actual transmission time of a take grows quite significantly with the increased interference level. Intuitively, with two independent pairs operating in parallel, one would be willing to put up with a twofold increase in the average take transmission time. Based on Figure 15, this happens around 𝑃𝑒=0.4, which was observed for the separation distance of about 5 m.

5. Enhancements and Generalizations

Although the solution discussed in this paper has been devised as part of a very practical project aimed at the development of a specific device catering to a specific application, it addresses a general problem that may be of relevance in other applications. In particular, we are currently building a wireless device for a profile matching application where the role of takes from the HDL is played by small pictures (photographs) exchanged by the nodes. The data exchange protocol in the new application has been copied verbatim from the HDL program.

5.1. Trading Reliability for Complexity

One of the characteristics of good “holistic” software solutions for microcontrolled devices is their breaking away from the layered paradigm of large-scale computing and networking. In consequence, some traditional concepts may assume different roles. This has happened to the concept of transmission window in our scheme. In contrast to traditional protocols, where the window counteracts the negative impact of the bandwidth-delay product, its primary purpose in our system is to reduce the channel reversal penalty (something that the traditional protocols have never worried about).

Our scheme can be extended to a scenario where the transmitted data has the appearance of a continuous stream, which can only be partially stored at the sending node. Of course, similar to other schemes [9, 28], it cannot possibly guarantee in such circumstances that all packets of the stream will always make it to the destination. But, as we argued elsewhere [35], all true streams must be prepared to accept occasional losses. With this allowance, it is easy to modify our solution to work with a circular buffer at the sender in a way that will automatically trade the buffer size for the perceived (effective) loss rate at the destination.

Let 𝑀 be the size of the circular buffer at the sender, that is, the buffer can accommodate up to 𝑀 most recent outgoing blocks of data. Let 𝐾 be the number of the first block stored in the buffer. The number of the last stored block is 𝐾+𝑀1. The process responsible for generating the blocks to be sent to the other party simply fills in the buffer in the standard way discarding the oldest samples in the natural FIFO fashion.

The modified scheme operates according to the original paradigm whereby rounds are triggered by single-packet requests specifying as many missing blocks as possible. Each of the outgoing block packets includes in its header the current value of 𝐾, to tell the recipient the minimum number of block that it can still request. If a block whose number is less than 𝐾 is missing, the recipient knows that it has been irretrievably lost (it makes no sense to ask for that block). Otherwise, the protocol executes exactly as before.

The block numbers can be stored in a modular fashion, such that even huge streams can be accommodated without overtaxing the short packet format. In order to know which blocks to request, the recipient must maintain a conceptual equivalent of the circular buffer, which can be reduced to a bit map, if the received blocks need not be stored at the node.

5.2. Collision Avoidance

By consistently obeying a simple set of rules, multiple nodes of the application, including multiple CPPs, as well as multiple HDLs controlled by the same CPP, can effectively avoid collisions among simultaneous take transfers and make sure that status packets do not interfere with takes. This can be accomplished in a way that properly accounts for hidden terminals.

First, consider the case of a single CPP servicing multiple HDLs. In this setup, only one take transmission can be active at any given time, as each of them must be initiated by the CPP. Consider two HDL nodes 𝐴 and 𝐵. Suppose that 𝐴 is sending a round of blocks to the CPP. If 𝐵 is located within the transmission range of 𝐴, it will refrain from sending a status packet for as long as 𝐴 is transmitting. This is because status packets are sent with LBT enabled, which means that 𝐵 will listen to the medium for a short while before transmitting and postpone its transmission when it senses another activity. To make this work, one has to make sure that the LBT interval is longer than the tiny interpacket gap separating two packets in a round. This is analogous to SIFS/DIFS spacing in 802.11 [21, 22].

In a sense, the back-to-back series of packets sent by the HDL in one round can be viewed as a single unit of transmission. The situation is in fact much more favorable, because even if 𝐵 damages the first block of the round (LBT race), it will necessarily yield to the remaining packets. Thus, the first block packet sent in a round can be viewed as a bandwidth reservation request addressed to all HDL nodes in 𝐴 's neighborhood. One may even consider putting a special (irrelevant) packet in front of a round batch, to make sure that LBT races never damage actual block packets.

Now suppose that 𝐵 is out of 𝐴's range, but within the range of the CPP, that is, it is a hidden terminal from the viewpoint of 𝐴. Then, 𝐵 has had an opportunity to see the CPP's request addressed to 𝐴. Note that that request specified (implicitly) the total number of blocks expected from 𝐴. Nothing stops 𝐵 from decoding that request (the same way it would decode a similar request addressed to itself) and estimating for how long 𝐴 will be transmitting the blocks. This estimate can be quite accurate, as round transmissions are highly deterministic. Consequently, 𝐵 will be able to hold back its traffic until the CPP has completely received the round.

In an environment with multiple CPPs, one has to avoid a situation when two CPPs simultaneously initiate take transmissions in a way that can make them interfere. Note that the HDLs involved in those transmissions can be allowed to interfere, as long as the reception at their CPPs is clear. Let us denote the two CPPs by 𝐶1 and 𝐶2, with 𝐻1 and 𝐻2 being their respective HDLs. The problem only occurs if 𝐻1 is within the range of 𝐶2 or 𝐻2 is in the range of 𝐶1. Suppose we have the first scenario (the other is obviously symmetric). Being within the range of 𝐶2, 𝐻1 will have an opportunity to hear 𝐶2's request addressed to 𝐻2. Consequently, it will not be responding to the requests of its CPP (𝐶1) for the amount of time inferred from the overheard request packet. This way, 𝐻1 will not start a round transmission for as long as 𝐻2 is handling the request of its CPP.

Notably, the nature of traffic in our application automatically takes care of the exposed terminal problem as well. The only rule to be added to the discussed scheme is that once an HDL decides that it is safe to transmit (according to the above rule), it will respond to a round request from its CPP blindly, that is, without employing LBT for the first packet. We mention this explicitly (even though we agreed that packets within a round are to be sent without LBT), because one might be tempted to send the first packet of a round with LBT enabled (in the spirit of treating the round as a logically single activity), but if the HDL knows of no foreign CPP requests in its neighborhood, it need not be polite to the neighbors with the transmission of its round. This is because any collision of that round will be purely local; both CPPs will be able to receive their blocks.

This discussion demonstrates that the interference experiments described in Section 4.2 do not pertain to scenarios expected to haunt real deployments, but were merely aimed at assessing the performance of our application under extreme stress. On top of the explicit collision avoidance techniques, there exist other ways to separate multiple CPP-HDL setups operating in the same area. First, one can assign different channels to those setups. Formally, the CC1100 RF module offers up to 256 different channels [11], whose separation can be improved by reducing their number. When using only 16 of those channels, that is, with 16 internal channels separating two adjacent application channels, the distance separation of 1 m between the two pairs resulted in the observed error rate below 2%. Creative power management is another option. In the present application, the CPP can ask its HDLs to adjust their transmitted power levels. Note, however, that by varying power levels used by different nodes, we spoil the approximate symmetry of neighborhood perception, which in turn affects the performance of reciprocity-based collision avoidance schemes.

5.3. Other RF Modules

Our choice of CC1100 for the project was dictated by its reasonable friendliness in terms of program interface as well as simplicity in terms of the built-in functionality. The module is a rather typical representative of its class. In particular, other variants of the same line by Texas Instruments, including CC2400, CC2420, and CC2430, offer essentially the same basic functionality augmented by on-chip MAC/routing features aimed at ZigBee compliance. Note that those modules operate in the 2.4 GHz band, which was considered unacceptable for the application for formal reasons. Even if the RF frequency was not an issue, the extra features of those modules would be useless for the application; consequently, they would essentially operate in a CC1100-compatible mode.

Note that the efficiency of take transmissions in our project hinges on exploiting the largest possible fraction of the raw bandwidth offered by the RF module. Thus, any advanced built-in features aimed at collision avoidance, bandwidth policing, and so on, would only get in the way. This property of our solution essentially puts all RF modules into the same basket.

In some of our other projects we have been using even simpler RF modules, for example, TR8100 form RF Monolithics [36], which, despite drastic differences in the interface, offers essentially the same functionality as CC1100. The primary difference is the absence of an on-chip LBT mechanism in TR8100, which, however, as we have verified elsewhere [8], can be effectively and efficiently emulated in software. Consequently, one should expect about the same results with other RF devices, as long as they offer similar transmission rates, which is the dominating factor affecting the overall performance of our scheme.

6. Conclusions

We have presented and analyzed a simple protocol for reliably transmitting file-like chunks of data between low-cost wireless devices. Our objective, dictated by the constraints of the application for which the protocol was specifically designed, was to maximize the bandwidth available to such transfers. The primary factor making the problem different from its classical formulation was the simplicity of the wireless channel whose raw capacity was tightly matched to the transmission bandwidth required by the application. By identifying and understanding the limitations of the platform, we were able to accomplish our goal and make the best use of its small (albeit sufficient) resources. Despite its stimulation by a very specific application, the problem appears to be quite general; our solution can be used to transmit reliably any file-like objects.

As a side effect of our presentation, we have demonstrated how good solutions in the embedded world can be arrived at by employing a holistic approach to the problems. Our data transmission scheme is a holistic derivative of some exotic properties of the RF module, flash memory, the limited amount of RAM, as well as the application-level demands. On the one hand, one may feel disappointed by this interference of the various apparently unrelated “features” into something that should rightfully belong to a well-established and separated “layer.” On the other hand, it is reassuring to see this much potential for creativity in an otherwise routine project. This potential is what makes the realm of embedded systems challenging in its own highly attractive sort of way.