

**H**sidizeo

# Ultra-Low-Power Software-Defined Radio for LTE Wireless Baseband:

an embedded systems grand challenge

Chris Rowen Founder and CTO, Tensilica 18 March 2010

# tensilica<sup>.</sup>

### **Two Curves**





**Mobile Broadband Terminal Units** 



# **Two Curves**

### Why Are These Curves Exciting?

- Incredible density mean true single-chip integration
- High volume on range of semiconductor designs
- New opportunities for processors

### Why are These Curves Frightening?

- Moore's Law enables high density, but how to we cope with complexity and rate of change?
- Many required functions, but few viable single-function chips
  - → Silicon platforms must integrate or die:
- So many transistors, so little battery capacity
  - → Steeper improvement in density than energy efficiency (new transistor types?)



# A Grand Challenge Problem: LTE Handsets

### **Design Goals**

- High data rates: 150Mbps DL/50Mbps UL
- High spectral efficiency with
  - Orthogonal Frequency Division Multiplexing
  - Multiple-Input Multiple Output
- Scalable bandwidth:1.25MHz to 20 MHz
- Both Frequency-Division and Time Division Duplex
- All IP Networks

### **Implementation Needs**

- Low silicon cost: <20mm<sup>2</sup> for digital PHY
- Low power: < 2mW/Mbps DL

# Paradox: Reach performance and power goals while building a programmable system





# The World's Quickest Introduction to LTE





# **Building Solutions to Solve Problems**

The raw material: more efficient processors

A range of building blocks: baseband processors

A solution architecture: LTE Reference Architecture





# The Essential Building Block





### **Xtensa LX3 Extensible Dataplane Processor**



# The Essential Automation Xtensa Processor Generator





#### Copyright © 2009, Tensilica, Inc.

11 ISTALLAN DE PORTAN

# Multiple Core Communication = Interconnect + Software





### **Software for Multi-core:**

- Modeling at every level:
  - 1. Gate/RTL level
  - 2. FPGA netlist
  - 3. Pin exact
  - 4. Cycle accurate
  - 5. Instruction accurate
- Multi-core debug and analysis
- Communications APIs
- Complete DSP Libraries
- Solution stacks for LTE
  - L1 PHY
  - L2/L3 MAC, RLC

### Scalable platform for all design approaches





# **Core 1: ConnX BBE16 Baseband Engine**

**Ultra-High Performance and Programming Ease** 

### **Architecture:**

- 16 simultaneous 18bx18b MAC per cycle
- 8 way SIMD + 3 way VLIW
- Dual load/store unit (128b wide)
- Scalar CPU pipeline with general 32b RISC instructions
- Extended precision with guard bits
- 6 addressing modes

## **Performance:**

- 16 multiply-adds per cycle
- Three 8-way ops per cycle
- 4 complex FIR taps / cycle
- 1 Radix-4 FFT butterfly / cycle
- 17GB/s memory bandwidth(@ 550MHz)
- 40-bit accumulation on all MAC operations without performance penalty





# Core 2: ConnX SSP16 Processor

### Processing soft bit streams- 3x more efficient than std DSF

#### **Data-types:**

- 10-bit and 8-bit Vectors
- 8-bit,16-bit and 32-bit Scalars

#### **Performance:**

- 16 arithmetic operations per cycle
- 2 issue VLIW architecture
- ~10GB/s memory bandwidth

#### Software:

- Vectorizing compiler
- Multi-core debugger
- Fast, cycle-accurate simulator
- Energy models
- Function libraries and 3<sup>rd</sup> party cellular stacks



tensilica

# Core 3: ConnX BSP3 Processor

#### "Bit Stream Processing at minimal size and power"

#### **Data-types:**

• 8-bit,16-bit and 32-bit Scalars

#### **Performance:**

- Dual Load/Store
- 3 issue VLIW architecture
- 600MHz in 45nm
- >4GB/s memory bandwidth

#### Software:

- C/C++ compiler
- Multi-core debugger
- Fast, cycle-accurate simulator
- Energy models
- Function libraries and 3<sup>rd</sup> party cellular stacks







## Core 4: ConnX Turbo16 *Programmable Turbo Decoding Engine*

- Customized Processor based solution for LTE Turbo decoding at 150 Mbps
  - Matches throughput of hardwired RTL solution with similar area, power
- MAX-Log-MAP based decoding
  - Uses: exp(log(a) + log(b)) ~ max(a,b)
- 8-parallel windows based decoding
  - Two bits from each window processed per update
  - Interleaving and deinterleaving operations integrated [ with load/ store operations to the local memories (Odd and Even D-RAM)
- Confidence estimator for software-based early-termination → lower power



tensilica

### LTE Reference Architecture "Atlas"



- Demonstration of best practices for low-cost/low-power multi-core-based digital PHY sub-system
  - 100% processor-based no RTL blocks
  - Highlights agile configuration of ConnX processors
- Proves efficiency of digital-PHY solution with processors
  - Reference architecture implements all blocks, memories and interconnect
  - Port of complete PHY software solution: mimoOn mi!MobilePHY<sup>™</sup> 3GPP Fully Compliant LTE PHY
- 7 core solution for Cat 4 UE (150Mbps DL/50Mbps UL 20MHz):
  - 3 x BBE16
  - 2 x SSP16
  - 1 x BSP3
  - 1 x Turbo16



## **ATLAS-LTE UE**



## Wrap-up

- Data-plane processors combine three characteristics
  - Adaptable: interface, instruction set, memory system automatically fit need
  - **Programmable**: proven instruction sets, compilers, libraries, SW stacks
  - Efficient: Rivals hardware area and power dissipation on complex problems
- 2. LTE handset challenge shows need ... and possibility ...for better solutions
- 3. Even bigger challenges ahead" SOC complexity in 2010 = 0.0

Relative SOC Complexity M1 features/mm<sup>2</sup>



tensilica