Title Page
Abstract
Contents
Chapter 1. Introduction 14
1.1. Multi-bit Flip-flop Methodology 14
1.2. State Retention Storage Allocation on Power Gated Circuit 18
1.3. Contributions of this Dissertation 23
Chapter 2. Enhancing Design Qualities Utilizing Multi-bit Flip-flops: A Design and Technology Co-optimization Driven Approach 26
2.1. Key Observations and Enabling Optimization Directions 26
1. Utilizing the full flexibility of D-to-Q flows in MBFFs to save route cost 26
2. Utilizing the unused space in MBFF footprint to resolve the timing problem 27
2.2. DTCO Framework for Multi-bit Flip-flops 31
2.2.1. The Proposed DTCO Flow 31
2.2.2. D-to-Q Flow Optimization 31
2.2.3. Timing-driven D-to-Q Flow Refinement 40
2.2.4. Timing Optimization at Post-Route Stage 44
2.3. Experimental Results 52
2.3.1. Experimental Setup 52
2.3.2. Comparing MBFF-opt with Conventional MBFF Allocation 55
2.3.3. Comparing MBFF-opt with Conventional No-Banking Flow 60
2.3.4. Runtime Analysis of MBFF-opt 60
2.3.5. Comparing MBFF-opt with Conventional No-Banking flow with more timing-optimized MBFF banking design 60
Chapter 3. Minimally Allocating Always-on State Retention Storage for Supporting Power Gating Circuits 64
3.1. Motivations 64
3.2. Optimal MBRFF Allocation Algorithm for l=2 66
3.2.1. Transforming Flip-flop Dependency Graph 68
3.2.2. Minimal-cost Covering for the Transformed Graph 71
3.2.3. Allocating MBRFFs According to Minimal-cost Covering 73
3.3. Extending Optimality of MBRFF-opt for l=3 74
3.3.1. Extending Node Replication and Edge Updating 74
3.4. Experimental Results 77
3.4.1. Minimizing Total Number of Bits of Retention Storage 82
3.4.2. Minimizing Total Leakage Power on Retention Storage 83
3.4.3. Minimizing Total Area of Retention Storage 83
Chapter 4. Further Consideration 84
4.1. Multi-bit Flip-flops in Power Gated Circuits 84
Chapter 5. Conclusions 88
5.1. Chapter 2 88
5.2. Chapter 3 89
Bibliography 90
초록 98
Table 2.1. Timing (setup time + clock-to-Q delay) on the flip-flops f₁, f₂, f₃, and f₄ in Fig. 2.2 as the transistor upsizing (i.e., folding) to level-1, level-2, and level-3 is... 30
Table 2.2. Wirelength reduction between Di and tiD (△WLD), and Qi and tiQj (△WLQj) for all cases in Fig. 2.4. (isRevDir() is described in Algorithm. 1.)[이미지참조] 41
Table 2.3. Notations used in our ILP formulation 48
Table 2.4. IWLS benchmark circuits used for the experiments. 55
Table 2.5. PPA comparison of the implementations produced by Conv. MBFF and our MBFF-opt. 56
Table 2.6. The number of MBFF cell instances replaced by our MBFF-opt. 58
Table 2.7. PPA comparison of the implementations produced by Conv. No-banking, Conv. MBFF, and our MBFF-opt. The unit of Conv. MBFF and MBFF-opt is percentage and the blue-colored numbers indicate improvement in comparison with that in... 59
Table 2.8. Runtime of Steps 1, 2, and 3 in MBFF-opt. 61
Table 2.9. PPA comparison of the implementations produced by Conv. No-banking, Conv. MBFF, and our MBFF-opt using high timing effort to merge and split MBFF. The unit of Conv. MBFF and MBFF-opt is percentage and the blue-colored numbers... 63
Table 3.1. Leakage power on the always-on retention storage in k-bit MBRFFs in Synopsys 32nm generic library and Chen. 67
Table 3.2. Area of k-bit MBRFFs in Synopsys 32nm generic library and Chen. (k=0 indicates flip-flop with no retention storage.) 67
Table 3.3. IWLS benchmark circuits 78
Table 3.4. Comparisons of the effectiveness of the existing best-known MBRFF allocation algorithm (Fan [2]) and our MBRFF-opt in total bits of retention storage of view. ("*" indicates the circuit is partitioned to a manageable size.) 79
Table 3.5. Comparisons of the effectiveness of the existing best-known MBRFF allocation algorithm (Fan [2]) and our MBRFF-opt in total leakage power on retention storage of view. ("*" indicates the circuit is partitioned to a manageable size.) 80
Table 3.6. Comparisons of the effectiveness of the existing best-known MBRFF allocation algorithm (Fan [2]) and our MBRFF-opt in total implementation area of retention storage of view. ("*" indicates the circuit is partitioned to a manageable size.) 81
Table 3.7. Problem sizes before and after partitioning flip-flop dependency graphs for large circuits. 82
Figure 1.1. (a) Structure of two 1-bit flip-flops. (b) Structure of 2-bit MBFF merging the two 1-bit flip-flops in (a). 15
Figure 1.2. Comparison of clock tree structure of circuit using (a) 1-bit flip-flops and (b) 4-bit multi-bit flip-flops. 16
Figure 1.3. Examples illustrating two physical limitations that hinder an extensive use of MBFF cells. (a) Non-flexible cell flipping. (b) Space waste in cell layout. 18
Figure 1.4. The internal structure of k-bit MBRFF in [3]. The blue and red translucent lines indicate the flow of state saving and restoration, respectively. 20
Figure 1.5. Illustration of cycle-by-cycle state restoration for the MBRFF allocation. Initially, at time t₀ flip-flip f₁ retains 3 states and f₃ 2 states. During the following three... 20
Figure 1.6. Comparison of uniform and non-uniform MBRFF allocations in terms of total retention bits and control network overhead. 21
Figure 2.1. Effectiveness of cell flipping on reducing route cost. (a) Flipping a single-bit flip-flop: Fully effective. (b) Flipping an MBFF: Partially or little effective. (c)... 28
Figure 2.2. Utilization of the empty space induced by the clock inverter sharing in MBFFs. (a) A CMOS circuit of DFFHQNx1 cell in ASAP7nm constructed from two CMOS latches. (b) Layout of DFFHQNx1 cell in ASAP7nm. (c) An MBFF composed of... 29
Figure 2.3. Our proposed DTCO flow MBFF-opt for synthesizing and utilizing MBFFs that are most suitable for target designs. The objective in Step 1 is to improve chip routability and Step 2 is to improve timing in global routing stage as well as chip routability... 32
Figure 2.4. Analysis of x-coordinate relation between Di, Qi, tiD, and tiQ where I(A,B) represents the interval between A and B in x-coordinate and x(A) denotes...[이미지참조] 35
Figure 2.5. Illustration of covering every D-to-Q flow combination of 4-bit MBFF by using 6 cells. 39
Figure 2.6. Proposed flow of timing-driven D-to-Q flow refinement by MBFF replacement. 43
Figure 2.7. Example of finding an optimal flip-flop reordering. (a) An MBFF instance with timing violation on the route path (red color) to pin D2. (b) Mapping function... 46
Figure 2.8. Timing elaboration flow 53
Figure 2.9. Three different flows of placement and routing conducted in our experiments. (a) Conv. No-banking: Conventional flow with no use of MBFFs. (b) Conv... 54
Figure 2.10. Comparison of the distribution of DRVs (white crosses) and timing violation registers (red rectangles) on the implementations of circuit USB FUNCT produced... 58
Figure 2.11. Changes of ILP runtime as the number of 4-bit MBFF instances changes in Step 3. 61
Figure 3.1. Example illustrating the effect of cycle-cut on allocation quality. 65
Figure 3.2. Comparison of retention storage reduction produced by [2] on IWLS [5] circuits when the wakeup latency constraint l is set from 1 to 2, 1 to 3, and 1 to 4 clock cycles. 66
Figure 3.3. Example of transforming an original flip-flop dependency graph to a feasible covering graph for MBRFF allocation. 69
Figure 3.4. An illustration of node replication, edge updating, and segment generation to maintain consistency, i.e., Fact 3, with the flip-flop state restoration process in G. 72
Figure 3.5. (a) Construction of constraint matrix and UCP solution. (b) MBRFF allocation for the solution in (a). 73
Figure 3.6. An illustration of node replication, edge updating, and segment generation to maintain consistency, i.e., Fact 3 (extended), with the flip-flop state restoration pro-... 76
Figure 3.7. (a) Construction of constraint matrix and UCP solution. (b) MBRFF allocation for the solution in (a). 77
Figure 3.8. MBRFF distribution of the allocation results for MEM CTRL in Table 3.4 and 3.6 where the green, yellow, and red small rectangles indicate 1-bit, 2-bit, and 3-bit... 78
Figure 4.1. Retention flip-flop structure used in power gated circuit. 85
Figure 4.2. Illustration of state retention storage allocation on a flip-flop dependency graph. (a) Total of 3 bits with latency of two clock cycles using non-uniform retention... 86