# Methodology for Energy-Efficient Design of Digital Circuits Vojin G. Oklobdzija,

Advanced Computer Systems Engineering Laboratory University of California / University of Texas TxACE: Center of Excellence for Analog Circuits <u>http://www.acsel-lab.com</u>

> IEEE SSCS, Distinguished Lecture Santa Clara Valley Chapter April 15, 2010

- 60

\*with acknolwedgment to contributions of my former students: Dr. Hoan Dao, AMD, Austin, TX and Dr. Bart Zeydel, Plato Networks

## Summary of the Presentation

Energy Efficiency of Digital CMOS Circuits

o Problems

- o Energy-Delay Relationship
- o Minimizing Energy for a given delay
- o Methodology
- o Determining the best structures for highperformance system
- o Implications on the architecture

## **Challenges in High-Performance Design**

- Optimizing for power not speed! (or maximizing speed under the power budget)
- Logical Effort (LE) optimizes for speed, regardless of power i.e. brings us in the worse energy spot.
- Our method optimizes for: *power @ given speed* or *speed @ given power*
- We are developing new approaches for power efficiency (overlooked by delay optimization) applicable to:
  - Circuit structures
  - Design techniques
  - Energy-Delay Space
  - Creation of optimal Standard cell (ASIC) libraries

## **Motivation for Energy Efficient Design**

Shekhar Borkar

### **Power density will increase**



- Power density passed the level found in the Nuclear Reactor !
- Power density degrades the reliability and speed.
- 4 April 19, 2010

## **Power Density: The Future**



With high power density, cannot assume uniformity

- As die temperature increases, CMOS logic slows down
- At high die temp., long-term reliability can be compromised



# **Energy-Delay Relationship**

Energy-

Energy-Efficient CMOS Circuit Design

6



- Must look at Energy-Delay Space of designs
- 7 April 19, 2010

**Energy-Delay Space View** 



Energy-Efficient CMOS Circuit Design



Begin to see characteristics of designs

9 April 19, 2010



Begin to see characteristics of designs

10 April 19, 2010



Best High-Performance designs are clearly seen

- Different than what would be chosen from single point
  - 11 April 19, 2010



Also determines best design for Low-Power Target

### Contribution of Wire to Delay and Energy should be examined too



• Without wire, differences appear large

13 April 19, 2010

### Contribution of Wire to Delay and Energy should be examined too



• Wire strongly impacts selection of "best adders"

14 April 19, 2010



15 April 19, 2010

# Where does Logical Effort lead us?



• It is possible to lower energy by trading delay? or ...

16 April 19, 2010



#### $(E-E_0)(D-D_0)=0.2\times E_0D_0$

\*P. Penzes, Caltech, PhD 2002, V. Zyuban, ISLPED 2002

Energy-Efficient CMOS Circuit Design



Energy-Efficient CMOS Circuit Design

### **Exhaustive search**

# A circuit with 10 transistors and 10 possible size for each transistor requires to check $10^{10}$ possible solutions!



C. Giacomotto, N. Nedovic, and V. G. Oklobdzija, "*The Effect of the System Specification on the Optimal Selection of Clocked Storage Elements*", IEEE JoSSC, vol. 42, no. 6, June 2007.



20 April 19, 2010



• Design choice depends on (E,D) requirements

### **Prior Work on Design Optimization**

Input

#### • Transistor-based [TILOS]

- Sizing individual transistors
- Growing complexity
- Applicable to small blocks only



Output

#### • Block-based [Zuyban & Strenski]

- Blocks: latch & logic
- Trading {energy,delay} of blocks
- CAD tools
- Fixed interface

### **Transistor-Based** Approach

- Optimization problem: (i=1..M)
  - Minimize: Area( $W_1, ..., W_M$ )  $\approx \Sigma W_i$
  - Constraint:  $D_{worst}(W_1,...,W_M) = T$
- Delay modeling
  - Linear (RC-like): TILOS
  - Look-up table: AMPS (Synopsys)
- A convex problem  $\Rightarrow$  minimal solution exists
  - Different polynomial algorithms developed
  - May have issues with convergence
  - Long run time with increasing design complexity

### Block-Based: Zyuban (IBM)



### **Application:** Solution Verification

![](_page_24_Figure_1.jpeg)

[Zyuban, IBM T.J. Watson Research]

- Verify optimality of solution:
  - Block 1:  $(w_1/u_1) \cdot \eta_1 = 2.0$  Block 2:  $(w_2/u_2) \cdot \eta_2 = 2.0$  Equal  $\Rightarrow$  optimal!

### **Application:** Solution Verification

![](_page_25_Figure_1.jpeg)

• If 
$$\eta_1 = 3.2$$
 and  $\eta_2 = 0.8$ 

- Block 1:  $(w_1/u_1) \cdot \eta_1 = 8.0$  Block 2:  $(w_2/u_2) \cdot \eta_2 = 0.5$  $Unequal \Rightarrow not optimal$
- Better solution? Relax  $\eta_1$  and increase  $\eta_2$

## **Major Limitation**

![](_page_26_Figure_1.jpeg)

- Zyuban's assumption:
  - Delay & energy independence of each block B<sub>i</sub>
- Single path: block = gate

 $\begin{bmatrix} Delay \ T_d \propto \{ C_{out}, C_{in} \} \\ Energy \ E \propto \{ C_{out}, C_{in} \} \\ C_{in} : current \ gate \ cap \\ C_{out} : next \ gate \ cap \end{bmatrix} : energy, \ delay \ dependency \\ of \ adjacent \ gates \\ C_{out} : next \ gate \ cap$ 

- Similar dependency between blocks and pipelines
- No analytical solution if accounting dependency

# Proposed Stage-Based Approach

![](_page_27_Figure_1.jpeg)

- Stage ≈ logic depth
- Gates  $\rightarrow$  stage
  - Based on maximal distances to input and output
  - Stage delay:  $d_{stage} = max\{d_{gate}\}$
  - Stage energy:  $E_{stage} = \Sigma E_{gate}$
  - Estimated from gate energy & delay models

## Delay and Energy Modeling of Gates

![](_page_28_Figure_1.jpeg)

# **Pipelined Stage Optimization**

# **Stage-Based Optimization**

![](_page_30_Figure_1.jpeg)

#### Optimization functions

- Delay:  $D = \Sigma D_{\text{Stage(i)}}$
- Energy:  $E = \Sigma E_{\text{Stage(i)}}$

#### Possible design constraints

- Delay target, D
- Input size, W<sub>input</sub>
- Output load, Cload
- Posynomial Problems
  - Solvable with polynomial algorithms

# **Problem A: Delay Optimization**

![](_page_31_Figure_1.jpeg)

- Optimization problem
  - Minimize:  $D = \Sigma D_{Si}$
  - Constraint: {Input, Load} = const.
- Objectives
  - Obtain minimally achievable delay, D<sub>min</sub>
  - Wanted in performance-critical designs
  - Disregard energy consumption (*actually*,  $\partial E_i / \partial D_i = \infty$ )

### Single Path: Logical Effort

![](_page_32_Figure_1.jpeg)

Solution = equal stage effort f (i.e. fan-out)

### Energy Cost vs. Total Delay

![](_page_33_Figure_1.jpeg)

April 19, 2010

### **Multi-Path Circuits**

![](_page_34_Figure_1.jpeg)

Optimal delay depends on off-path load

April 19, 2010

## Linear Branching

![](_page_35_Figure_1.jpeg)

- Linear branching:  $C_{off,i} / C_i = const.$   $dD/dC_i = 0 \ (\forall i = 1..N): f_i = f_{opt} = \left[ \left( \prod_{i=1}^N g_i \right) \left( \prod_{i=1}^N b_i \right) \left( \frac{C_{Load}}{C_{in}} \right) \right]^{1/2}$   $D_{min} = N f_{opt} + \sum_{i=1}^N p_i$ Branching Factor, B
- Similar analytical form for solution
### Non-Linear Branching



- Nonlinearity due to:
  - Constant off-path load (wire cap, min-size gates)
  - Unequal path lengths
  - Parasitic delay difference of gates

#### No analytical form

- Recursive solving
- Solution: unequal stage efforts

### 64-bit Static Kogge-Stone Adder



### 64-b KS: Stage Effort Distribution



Nonlinearity causes unequal stage efforts

 Nonlinear factors: wire load, parasitic delay diff.

### 64-b KS: Energy versus Delay



Significant wire effect on delay & energy

April 19, 2010

# Problem B: Energy-Delay Tradeoff



- Optimization Problem
  - Minimize:  $E = \Sigma E_{\text{Stage(i)}}$
  - Constraint: {Input, Load} = const.

$$\Sigma D_{\text{Stage}(i)} = D_{min} + \Delta D$$

- Objectives
  - Avoid infinite energy sensitivity at D<sub>min</sub>
  - Equalize energy-delay sens.:  $(\partial E/\partial D)_{Si} = (\partial E/\partial D)_{Sk}$
  - Trade delay for energy (traditional approach)

### E-D Trade-off: Single Path



### E-D Trade-off: Stage Effort Distrib.



April 19, 2010



### 64-b KS: Energy Delay Tradeoff



- D<sub>min</sub> solution is very inefficient in energy
- 55% energy saving with 10% delay traded
- Solution = equal stage energy-delay sensitivity

#### Decrease Stage Effort LE Decrease Stage Effort LE Decrease Stage Effort LE Decrease Stage Effort LE Decrease Stage If this is your design point - the drop is steep ! But you should not be designing there!!

Increase

**Stage Effort** 

For a small sacrifice in delay

the energy savings are big !

#### From:

R.W. Brodersen, M.A. Horowitz, D. Markovic, B. Nikolic, V. Stojanovic, "Methods for True Power Minimization," International Conference on Computer-Aided Design, ICCAD-2002, Digest of Technical Papers, San Jose, CA, November 10-14, 2002, pp. 35-42.

46 April 19, 2010

Energy-Efficient CMOS Circuit Design

**Delay** 

# **Problem C: Energy Minimization**



• Problem:

- Minimize:  $E = \Sigma E_{Si}$
- Constraint:  $D = \Sigma D_{Si} = const.$

Load = const.

- Objective:
  - Obtain absolute minimal energy @ given delay
  - Equalize energy-delay sens.:  $(\partial E/\partial D)_{Si} = (\partial E/\partial D)_{Sk}$
  - Trade input size such that  $(\partial E / \partial Input)_D = 0$

#### Single Path: Stage Effort Redistrib.





April 19, 2010

### 64-b KS: Minimal Energy vs. Delay



- 30 50% energy saving @ same performance
- 1.6 3.6X input size

# **Pipelined System Optimization**

### **Pipelined System Optimization**

- Design constraints
  - Delay target
  - External I/O constraint
- How to obtain minimal-energy solution?
  - Pipelined stages
    - Minimized for energy
    - Sensitive to input and load variations
  - System level
    - Balancing energy sensitivities at pipelined boundaries
  - Recursive process

### Pipelined Stage: Efficient E-D Area



### **Efficient Input-Delay Area**



Larger/smaller input <> less/more energy

### Efficient Energy-Input @ D = const.



#### Sensitivity to Load @ D = const.



### System Energy Optimization



- Energy minimization
  - Pipeline: minimize energy @ given input & load
  - System: balance energy sensitivities @ boundaries
- Trading elements: input size, output load
- Optimal criteria:

$$E_{StageA_{i}} = minimal$$

$$\sigma_{E,InputA_{i}} = \sigma_{E,LoadA_{i-1}}$$

### How to Achieve Less Total Energy



 $E_{i-1} + E_i = min \iff \sigma_{E,Input(i)} = \sigma_{E,Load(i-1)}$ 

### Case Study: Media Datapath



### Case Study: Optimal Criteria



### Case Study: Optimal Algorithm



#### Case Study: Media Datapath Solution



System Energy-Delay @ V<sub>DD</sub>=const.



Similar E-D characteristics as single stages

• Possibly less input size @ lower delay

April 19, 2010

### Effect of Supply Scaling



### Potential Saving of System Energy



 Significant energy saving with correct supply or delay selection

### Energy-Delay Improvement of Pipelined Stages

## Architectural Advantages



1/N clock rate

Same clock rate

# **Energy-Throughput Comparison**



Pipelining is mostly more efficient in E-D domain!

April 19, 2010

## ACS Unit Implementation



## Energy-Throughput Comparison



Less energy for deeper pipeline at given throughput

# Designing a System for a Fixed Performance in the Energy-Delay Space

Energy-Efficient CMOS Circuit Design

### **Pipelining and Parallelism for an ACS Circuit**



Energy-Efficient CMOS Circuit Design
## **Energy-Delay Results for Parallelism**



<sup>73</sup> April 19, 2010

Energy-Efficient CMOS Circuit Design

## **Energy-Delay Results for Pipelining**



74

## Energy-Delay Estimation Matches Complex Circuit Simulation



Energy-Efficient CMOS Circuit Design

75 April 19, 2010

#### Energy Efficiency of Architectural Choices (including supply scaling and circuit sizing)



76 April 19, 2010

#### Eergy-Delay Results for Parallelism and Pipelining



77 April 19, 2010

## Is it possible to lower the Energy?



- Reduce Energy for same Delay!
- Improve Delay for same Energy!

78 April 19, 2010

## Achieved Energy Savings in KS and HC Adders



Simulation of 64-bit static adders confirms saving!



80 April 19, 2010

Energy-Efficient CMOS Circuit Design

# **Reduction of Hot-Spots**



Energy minimization improves hotspots!

81 April 19, 2010

#### Accomplishments (June 30, 2003 to June 30, 2004)

90nm technology



# Summary

- Energy-Efficient Design requires:
  - Early structure comparison in energy-delay space
  - Early Layout/Floorplanning
  - Optimization using energy minimization objective function
- LE does not guarantee a good design
- Our method of energy-minimization focuses on reducing power
- The same principles hold for other logic functions

## Future Work

- Methodology improvement
  - General design rules
  - Algorithms with fast convergence
  - Guidelines for close-to-optimal solutions
- CAD tool
  - Custom vs. standard cell library
- Improve gate modeling & characterization
  - Worst-case vs. single-switching,... or between?
  - Process variation

April 19, 2010

## Publications on Energy-Delay:

- Vojin G. Oklobdzija, Bart R. Zeydel, Hoang Dao, Sanu Mathew, Ram Krishnamurthy, "*Energy-Delay Estimation Technique for High-Performance Microprocessor VLSI Adders*", Proceedings of the International Symposium on Computer Arithmetic, ARITH-16, Santiago de Compostela, SPAIN, June 15-18, 2003.
- Hoang Q. Dao, Bart R. Zeydel, Vojin G. Oklobdzija, "*Energy Minimization Method for Optimal Energy-Delay Extraction*", Proceedings of the European Solid-State Circuits Conference, ESSCIRC 2003, Estoril, PORTUGAL, September 16-18, 2003.
- V. G. Oklobdzija, B. R. Zeydel, H. Q. Dao, S. Mathew, R. Krishnamurthy, <u>"Comparison of High-Performance VLSI Adders in Energy-Delay Space"</u>, *IEEE Transaction on VLSI Systems*, Volume 13, Issue 6, pp. 754-758, June 2005.
- Hoang Q. Dao, Bart R. Zeydel, Victor Zyuban, and Vojin G. Oklobdzija, "On Energy Optimization of Digital Systems", The fourth annual IBM Austin Conference on Energy-Efficient Design, ACEED 2005, Austin, Texas, March 1-3, 2005.
- H. Q. Dao, B. R. Zeydel, V. G. Oklobdzija, "Energy-Efficient Optimization of the Viterbi ACS Unit Architecture", *Proceedings of the Asian Solid-State Circuit Conference, A-SSCC 2005, Hsinchu, Taiwan, November 1-3, 2005*.
- H. Q. Dao, B. R. Zeydel, V. G. Oklobdzija, <u>"Energy Optimization of Pipelined Digital</u> <u>Systems Using Circuit Sizing and Supply Scaling</u>", *IEEE Transaction on VLST Systems*, Vol. 14, Issue 2, Feb. 2006 pp. 122-134.
- S. K. Hsu, S. K. Mathew, M. A. Anders, B. R. Zeydel, V. G. Oklobdzija, R. K. Krishnamurthy, S. Y. Borkar, "A 110 GOPS/W 16-bit Multiplier and Reconfigurable PLA Loop in 90-nm CMOS", IEEE Journal of Solid-State Circuits, Vol.41, No.1, January 2006.
- B. R. Zeydel, D. Baran, V. G. Oklobdzija, *"Energy Efficient Design of High-Performance VLSI Adders"*, IEEE Journal of Solid-State Circuits, June 2010.

٠