# Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors

Sandro Bartolini\*

Department of Information Engineering, University of Siena, Italy

bartolini@dii.unisi.it

OPTICS Workshop, Grenoble, 13/3/2015





- Introduction
- Ring-based optical interconnection tradeoffs
- Fast path-setup for switched optical networks
- Software restructuring for matching ONoC features
- Conclusions





### Introduction and motivation – processors (1)

- Nowadays processors are parallel
  - Biggest reason was the emerging of <u>wire-delay issues</u> ... i.e. on-chip latency









Pentium 4 (1)

CoreDuo (2)

i7-980X (**6**)

i7-5960X / AMD FX8370 (8)

Beyond about 10 → tiled design





Tilera Tile64 (64)

### Introduction and motivation – processors (2)

- "... when processor A wants to talk to processor B ..."
- Shared memory model is here to stay ... some more
  - At least within core clusters
  - Ease of programmability
  - Scalable directory-based coherence
    - Numerous message exchanged for each load/store
    - E.g. 80 % control (8 byte) and only 20% data (64 byte)



4

#### Introduction and motivation – coherence traffic



Total and control- (Ctrl) or data- (Data) only stats for ideal network

Low average load, bursty → message latency is critical

#### Introduction and motivation – Integrated photonics (1)

Optical communication technology can now be integrated in CMOS process



#### Pros:

- Fast propagation (16 ps/mm)
  - 10x less latency compared to electronics (e.g. 22nm [1])
  - E.g. Couple of cycles @3 GHz to cross a 2cm chip corner to corner
- Compatible with CMOS fabrication
- High-bandwidth: 10-40 GHz frequency and WDM (order of 1 Tbps)
- End-to-end: energy consumption almost insensitive to distance

#### • Cons:

- End-to-end ... no store-and-forward → Throw away a lot of knowhow
- Active components (lasers, photodetectors) have some integration problems
  - Can have or induce significant static power consumption

#### Introduction and motivation – Integrated photonics (2)

- Integrated photonics is still in its infancy in serving computer systems <u>local</u> requirements
  - Traffic close to cores is very different from the aggregated traffic at wider scale (e.g. blade/rack/datacenter)
- Layered design is not yet consolidated enough
  - Application-, architecture-, network-level requirements and choices are not orthogonal to optical design choices
    - Like: topology, access schemes, resource provisioning, DWDM, technological choices
    - Interactions can induce very different performance/consumption in the optical network
- Need for an integrated multi-layer approach
  - Effective designs
  - Exploration and consolidation of best practices



- Introduction
- Ring-based optical interconnection tradeoffs
- Fast path-setup for switched optical networks
- Software restructuring for matching ONoC features
- Conclusions





#### Ring-based ONoC tradeoffs

- Hybrid Electronic-Optical on-chip networks or <u>all-optical</u>
  - Optical network based on ring logical topology
  - Simplicity of ring topology can be a good reference design point
- We analyze the relationships between optical resource provisioning (waveguides) and core number, versus:
  - Traffic quote offloaded to the optical network
    - (One ring) Only read-requests, invalidations and invalidation acks
    - (Multiple rings) All traffic
  - MWSR, MWMR access schemes

Considering performance and power metrics



#### Simulation Environment and Methodology

- Gem5 simulator in Full-System mode (Linux 2.6.27 booted in gem5), 8/16/32/64 core running Parsec 2.1 multi-threaded benchmark suite
  - Multi-threaded applications → We forced <u>core affinity</u> on the application execution to avoid non determinism due to OS scheduling

#### **Architectural parameters**

| Cores        | 8/16/32 cores (64 bit), 4 GHz                                                                                                  |  |
|--------------|--------------------------------------------------------------------------------------------------------------------------------|--|
| L1 caches    | 16 kB (I) + 16 kB (D), 2-way, 1 cycle hit time                                                                                 |  |
| L2 cache     | 16 MB, 8-way, shared and distributed 8/16/32 banks, 3/12 cycles tag/tag+data                                                   |  |
| Directory    | MOESI protocol, 8/16/32 slices, 3 cycles                                                                                       |  |
| ENoC         | 2D-Mesh/Torus, 4 GHz, 4/5 cycles/hop, 32 nm, 1 V, 64/128 bit/flit                                                              |  |
| Optical Ring | 3D-stacked, 1-9 parallel waveguides, 30 mm length, 8/16/32 I/O ports, 10 GHz, 64/70 (16 and 32) wavelengths, 460 ps full round |  |
| Main memory  | 4 GB, 300 cycles                                                                                                               |  |

#### Results: multiple rings and all traffic

- No electrical NoC : searching for good design points
  - From low bandwidth up to 8/9 64-wavelength rings



[Grani, Bartolini, "Design Options for Optical Ring Interconnect in Future Client Devices", ACM JETC, 2014]

### Results: energy

- Sensitivity to the number of photonics rings for MWMR and MWSR
  - Ring number increase → Overall NoC optical power increase, MRR increase, IL increase (crossing, splitting, ...), increased laser power
  - But some topology assumptions can make the difference ...



[Grani, Bartolini, "Design Options for Optical Ring Interconnect in Future Client Devices", ACM JETC, 2014]

- Introduction
- Ring-based optical interconnection tradeoffs
- Fast path-setup for switched optical networks
- Software restructuring for matching ONoC features
- Conclusions





#### Fast path-setup for switched optical networks

- Switched optical networks can provide
  - Potentially higher scalability than ring (crossbar)-based approaches
  - Require broadband optical switches
  - But suffer from sequential path-setup time
    - <u>High overhead</u> for high <u>endpoint number</u> and for "small" message sizes
      - "small" can mean less than 1000s bytes!
      - Coherence traffic is out of game



#### Fast path-setup for switched optical networks

- We propose a centralized arbiter that can <u>simultaneously configure the</u> <u>required optical switches</u> through a wavelength-routed optical-ring
  - Network from cores to arbiter is optical ring-based (wavelength-routed)
  - Decoupling topologies of path-setup network and data network





#### Results single arbiter

| Setup       | Serial PS<br>[cycles] | Simultaneous PS<br>[cycles] |
|-------------|-----------------------|-----------------------------|
| 8-core-AVG  | 51.26                 | 25.94                       |
| 16-core-AVG | 70.37                 | 27.45                       |
| 32-core-AVG | 156.18                | 65.60                       |

Path-setup latency: dramatic average reduction even if conceptually the arbiter serializes pathsetup requests

- Arbiter works well 8-and 16-core setups
- For 32-core case arbiter induce 25% slowdown due to path-setup serialization
- In all cases arbiter performs much better than the serial path setup
  - Serial PS prevents cache coherent traffic to work
  - Arbiter support this traffic



### Scalable fast-setup: multiple arbiter





**Multi-Arbiter 4-Clusters** 

- One arbiter per cluster, optical ring between clusters
- Independent path-setups within clusters
- <u>Coordinated inter-cluster path-setup</u>
- <u>2-level MOESI to increase in-cluster</u> paths



- Introduction
- Ring-based optical interconnection tradeoffs
- Fast path-setup for switched optical networks
- Software restructuring for matching ONoC features
- Conclusions





#### Software restructuring for matching ONoC features

- On-chip optical networks can potentially improve wire-delay issues in tiled CMPs
  - Distant cores (i.e. many "hops" away) can be actually reached in a few cycles
- Access time to distributed cache resources is <u>more-uniform</u>
  - Not Uniform Cache Access (NUCA) architectures (e.g. in tiled CMPs)
- Software restructuring techniques for locality typically aim at putting data close to usage (cores) as to reduce access time in NUCA caches
  - Delicate balance between ideal access barycenter from source cores and conflicts (misses) penalties in zones that need to be congested for low NoC access overhead



#### Software restructuring for matching ONoC features (2)

- With ONoCs on average we can afford to use far more cache
  - Spread data much more in the chip to reduce conflicts (misses) as <u>distance</u>
     overhead is almost constant
  - Hyp: using our arbitrated optical switched network
- Not exactly straightforward though:
  - Actually with switched networks, conflicts arise not only for data placement ....
     but for <u>message paths</u> ... and, specifically, <u>sub-paths</u>!
  - Need to
    - Spread data to gain from reduced conflict misses
    - But (!) not too far from barycenter otherwise average path-length increases and path-setup conflict probability increases (big overhead !)
  - Each VM page is positioned in the access barycenter first
  - Then it is tried around a "radius" of H hops, looking for the minimum of a cost-access function
    - Considering: misses, path-length and conflicts, cache, memory and ONoC access times



<sup>[</sup> Grani, Bartolini, Frediani, Ramini, Bertozzi, "Integrated Cross-Layer Solutions for Enabling Silicon Photonics into Future Chip Multiprocessors", IEEE IMS3TW, 2014]

#### Results: Software restructuring for switched ONoC



- Software restructuring allows gaining 19% more speedup over the standard arbiter solution (7% over the Baseline)
  - Electronic baseline cannot benefit from this software restructuring due to heavy Non-uniform access time to tiles

- Introduction
- Ring-based optical interconnection tradeoffs
- Fast path-setup for switched optical networks
- Software restructuring for matching ONoC features
- Conclusions





#### Conclusions

- Integrated optics is a breakthrough technology that brings a number of positive facets
  - Improvements in devices will make it even better perspective
- Its technological discontinuity needs to be integrated with patience and with a thick vertical design approach in modern processor and computer design
  - To master and exploit the various <u>two-way inter-related effects</u> that are nowadays present between layers
    - significant risk of sub-optimal designs ... if not even worse than some well-tuned electronic solutions
  - For reaching effective designs
    - Opportunities and constraints of different layers must ... meet in the middle ©

## Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors

### Thanks for your attention!

**Q&A** 



