My research in computer architecture focuses on interconnection networks for scalable multi-core and distributed processor microarchitectures. Computer architecture bridges the gap between application and device technology. Application workloads have never been more compute intensive, meanwhile device constraints of power, heat, and reliability have recently forced the computer industry to shift focus from single processor core performance to instantiating multiple processor cores on each chip. As computer architecture enters the era of multi-core, intra-core communication becomes critical to achieve continued increasing system performance. As a graduate student I led the design, implementation and verification for the second level cache and inter-processor network-on-chip (NOC) on the TRIPS processor prototype [2]. Building upon my work on the TRIPS networks-on-chip, my dissertation research explores network utilization efficiency of realistic NOC designs as a means to improve performance and manage power. Looking forward, I plan to focus my future research on closing the expanding gap between application and technology as we move towards chip-multiprocessors (CMPs) with hundreds to thousands of cores. In this statement, I outline the contributions of my dissertation and discuss the direction of my future research.
Dissertation Summary
Off-chip interconnection networks provide for communication between processors and components within computer systems. Semiconductor process technology trends have led to the inclusion of multiple processors and components onto a single chip and there is much recent research in on-chip interconnection networks to connect them together. On-chip networks provide scalable, high-bandwidth interconnect enabling distributed microarchitectures and chip-multiprocessor designs. On-chip networks present several new challenges, different from off-chip networks, including tighter constraints in power, area, and end-to-end latency as well as increasing reliability concerns.
My dissertation focuses on interconnection network architectures that address the unique network-on-chip design challenges of power, reliability, and balanced network utilization. My work in the design, implementation, and evaluation of the on-chip networks of the TRIPS project’s prototype processor, a real hardware implementation, is the foundation for my on-going work in on-chip networking. Building on my analysis of the TRIPS on-chip networks I explored novel network architectures that use live, full-network status information to improve network utilization efficiency.
The TRIPS Processor:
TRIPS, an experimental computer system we designed and prototyped, is a distributed processor microarchitecture in which traditional processor components are divided into a collection of self-contained tiles. The tiles of the TRIPS system communicate with one another via control and data networks; these networks make a distributed processor design possible [5]. I designed and implemented the TRIPS second level (L2) cache system, the banks of which are interconnected by the TRIPS On-Chip Network (OCN) and its off-chip extension the TRIPS Chip-to-Chip (C2C) network. The TRIPS L2 cache supports polymorphic configuration as a shared, static NUCA L2 cache, private L2 NUCA caches for each processor, or directly addressed scratch pad memory. This polymorphism is facilitated through architectural support built into the OCN network-on-chip. The OCN also interconnects the two processor cores with various I/O units and off-chip to other TRIPS chips via the C2C. A second network on-chip, the Operand Network (OPN), interconnects the execution units and serves as an operand bypass network, integrated tightly with the processor core.
My evaluation of the TRIPS NOCs shows that these networks provide processor performance within 28% of non-contended, ideal interconnect [4, 2, 3]. This evaluation serves as a case study in the effect on-chip design constraints have on the design of networks on-chip, and the influence the inevitable design trade-offs have on full system performance. One particular insight gained from the TRIPS OCN and OPN networks was how network resource imbalances, which lead to contention and poor performance, are transient with time and task. Timely information about the status of the network can be used to balance the resource utilization and improve system performance.
Interconnection Network Research:
The evaluation of the TRIPS networks on-chip illustrated that minimizing end-toend packet latency was critical to maintaining system performance. A challenge lies in providing the right information, conveyed in a timely fashion. Effective use of contention information, without affecting end-to-end latency, is challenging in networking on-chip. In my dissertation, I explore several novel adaptive routing techniques that address the challenge of managing the end-to-end latency. One of the techniques, called regional congestion awareness (RCA), is a lightweight mechanism for integrating congestion information from different points in the network into the port selection process of a typical network on-chip adaptive router. RCA uses a low-bandwidth monitoring network to propagate congestion information among adjacent routers. At each hop along the way, local congestion status is aggregated with information from neighboring nodes; the aggregated information is used for port selection and then propagated to upstream routers. RCA outperforms standard adaptive routing across a variety of synthetic and real application workloads, with a 16% average and 71% maximum latency reduction on SPLASH-2 benchmarks running on a 49-core CMP [1]. Another technique I explore in my dissertation, called source calculated congestion aware networks (SCCAN), uses congestion information from across the network to inform route selection. The route is then encoded into the packet header at injection time. This approach reduces router complexity and yields a deeper view of network contention. In addition to reducing average packet latency, collectively these techniques provide benefits in terms of power, reliability and network complexity reduction.
Implementation Experience:
In addition to my network-on-chip research, I played a lead role in the implementation of the TRIPS processor prototype. I designed and developed the RTL database, and the verilog simulation and synthesis infrastructure for the TRIPS processor prototype, a 170M-transistor chip in a 130nm ASIC process technology. Actual implementation, whether in silicon or on an FPGA substrate, provides an opportunity both to validate research and to gain unique insights into design trade-offs. These insights are unattainable in simulator based architecture studies. For example, the experience of achieving timing closure on the units of the second level cache highlighted the tradeoffs between area, timing, and power in a concrete fashion. These insights inform my work as a researcher in computer architecture.
Interdisciplinary Research:
As a master’s student in the Department of Electrical and Computer Engineering at the University of Florida, my advisor, Dr. Fred Taylor, and I collaborated with Dr. Thomas D. Carr of the Department of Astronomy to examine mono-tonal chirps in a radio astronomy signal from the rings of Jupiter using digital signal processing techniques. Similarly, my computer architecture research at the University of Texas at Austin sits on the boundary between traditional electrical engineering and computer science. In discussions within my research group, composed of students from both the Department of Electrical and Computer Engineering and the Department of Computer Science, I found novel solutions to NOC research problems such as compiler based solutions to NOC routing problems. In an effort to extend my research collaboration to groups outside of my university, I recently provided TRIPS NOC traffic traces and interpretation assistance to research groups at Princeton University, the University of California at Irvine, and the Technical University of Valencia (Spain). Interaction with researchers of broadly diverse backgrounds provides both the impetus for research in terms of new problems to solve and insights in to possible solutions for those problems. In the future I plan to forge partnerships with researchers in- and outside my department, as well as outside my institution.
Future Work
Rarely has there been as exciting a time to do research in computer architecture. From grand challenge problems such as long-term climate modeling and the simulation of the molecular dynamics of cell membranes, to implementing real-time human interfaces on low power, handheld devices, application compute requirements continue to increase. Meanwhile, the physical design constraints of power, heat, and reliability have proven traditional superscalar microarchitectures unscalable. The demand for novel architectures to close the gap between application and technology has never been greater. As a computer architecture researcher my focus will be on closing this gap. To this end I plan to pursue the following:(1) future multi-core chip interconnect and programmability; and (2) architecture for post-CMOS process technologies.
Future Interconnect:
The major chip manufacturers have collectively given up improving single processor performance in favor of stamping out many processor cores on each chip. This approach provides easy scaling with process technology and economizes design effort. Assuming continued Moore’s law transistor count scaling we can expect the development of massively-multicore CMPs (MM-CMP), CMPs with hundreds to thousands of cores, within ten years. Interconnect in MMCMPs will be crucial to system performance providing several near-term challenges: (1) on-chip interconnect latencies will be high, impacting system performance; (2) off-chip memory system interconnect bandwidth will be a bottleneck to system performance; and (3) programming applications that utilize an MM-CMP efficiently will be extremely difficult.
Latency Management: The planar constraints of low-cost high volume CMOS technology imply that high order interconnection network topologies will be impractical for MM-CMPs. As a result, MM-CMP network topologies must have large diameters to connect all nodes. High network diameter leads to high worst-case packet latencies between distant cores, although latencies between neighboring cores will remain low. I plan to investigate network topologies tailored for NOCs, informed task placement, and task migration as a means to reduce communication latencies. I will also investigate distributed quality of service to reduce critical path latencies for tasks that cannot be moved closer together. Emerging technology, such as 3D stacking and optical interconnect, may alter some of the constraints placed on NOCs, however, latency and bandwidth management will always be an important focus of interconnect research.
Off-chip Memory System Interconnect: Over the past ten years, many researchers have addressed increasing latencies to access off-chip memory. Much of this work leverages locality to overcome the high latencies of memory system access through caching, prefetching and related techniques. MM-CMPs will use parallel independent threads to achieve performance, each thread requiring independent data. These independent data streams defeat the locality caching exploits. In MM-CMPs, memory system bandwidth will be a greater design constraint than latency. I plan to investigate techniques to overcome memory system interconnect bandwidth limitations. Adding more on-chip memory controllers should help, however package pin count constraints will limit the long-term scalability of this solution. Emerging technologies, including high-speed serial links, through-die interconnect and off-chip optical interconnect, show promise in providing scalable off-chip interconnect. In collaboration with VLSI, and optical interconnect researchers in the broader community I will identify possible off-chip interconnect technologies and design memory system architectures to exploit them.
MM-CMP Programmability: Multi-threaded programming targeted at MM-CMPs promises to be a grand challenge. The difficulty in writing and debugging multi-threaded code, even for experienced programmers, is a well-known problem. Researchers have developed many techniques aimed at providing a level of abstraction to make programming for CMPs easier. Unfortunately, easing the programming effort will not necessarily translate into scalable performance gains in multithreaded or streaming programs. Naive thread or stream placement will create communication patterns with cause high contention, and poor spatial locality of data and task, leading to high latencies. I will develop network-on-chip profiling tools to help programmers deal with the challenge of programming for an MM-CMP. For example, visual aids illustrating the traffic patterns and associated contention that naively written programs create would provide vital feedback to the programmer enabling performance optimizations.
Architecture for Future Process Technology:
The 2005 ITRS report points to the end of CMOS technology scaling within the next ten to fifteen years. Beyond traditional CMOS, several developing device technologies show promise, including carbon nanotubes, ion-trap quantum logic, and optical interconnect and logic. It is currently unclear which of these technologies will prove capable of surpassing CMOS’s performance, however, across many of them reliability promises to be one of the greatest constraints on future microarchitectures. Reliability enhancement techniques such as, intra-core microarchitectural redundancy, core redundancy, and duplicate networks with parity and retransmission all show promise. Power consumption will continue to be a constraint in future process technologies. Many of the strategies to address reliability have an impact on power consumption so computer architects will have to address both simultaneously. The structure of biological systems, such as the human brain, may provide useful inspiration on microarchitectural techniques to deal with extremely unreliable, power hungry substrates. I intend to forge working relationships across disciplinary and institutional boundaries with researchers in fields including, device physics, materials, and photonics, to investigate future process technologies as they develop and propose architectures best suited to their characteristics.
The computer industry approaches an inflection point. Previous design strategies have proven unscalable, meanwhile the underlying process technology of CMOS approaches its end. More than any time in the past twenty years, the computer industry is looking to academic research to show the way forward. Leveraging my past experience and research I intend to perform groundbreaking research in computer architecture with the goal of gaining industry acceptance of my work.
References
[1] P. Gratz, B. Grot, and S. W. Keckler. Regional Congestion Awareness for Load Balance in Networks-on-Chip. In The 14th International Symposium on High-Performance Computer Architecture (HPCA) (accepted for publication), February 2008.
[2] P. Gratz, C. Kim, R. McDonald, S. W. Keckler, and D. Burger. Implementation and Evaluation of On-Chip Network Architectures. In The 2006 IEEE International Conference on Computer Design (ICCD), pages 477–484, October 2006.
[3] P. Gratz, C. Kim, K. Sankaralingam, H. Hanson, P. Shivakumar, S. W. Keckler, and D. Burger. On-Chip Interconnection Networks of the TRIPS Chip. IEEE Micro, 27(5):41–50, 2007.
[4] P. Gratz, K. Sankaralingam, H. Hanson, P. Shivakumar, R. McDonald, S. W. Keckler, and D. Burger. Implementation and Evaluation of a Dynamically Routed Processor Operand Network. In The First ACM/IEEE International Symposium on Networks-on-Chip (NOCS), pages 7–17, May 2007.
[5] K. Sankaralingam, R. Nagarajan, P. Gratz, R. Desikan, D. Gulati, H. Hanson, C. Kim, H. Liu, N. Ranganathan, S. Sethumadhavan, S. Sharif, P. Shivakumar, W. Yoder, R. McDonald, S. Keckler, and D. Burger. The Distributed Microarchitecture of the TRIPS Prototype Processor. In The 39th ACM/IEEE International Symposium on Microarchitecture (MICRO), pages 480–491, December 2006.