Jonathan Freed, University of South Carolina Saurabh Gupta and Devesh Tiwari, Oak Ridge National Laboratory (OLCF) Acknowledgments • This work was supported in part by the U.S. Department of Energy, Office of Science, Office of Workforce Development for Teachers and ScienBsts (WDTS) under the Science Undergraduate Laboratory Internship program. • This research used resources of the Oak Ridge Leadership CompuBng Facility at the Oak Ridge NaBonal Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DEAC0500OR22725. • I would like to thank the University of South Carolina’s Office of Undergraduate Research and Computer Science and Engineering Department, as well as ACM’s SRC for travel support. References • MaS Ezell, “Understanding the Impact of Interconnect Failures on System Opera9on”, In Cray User Group (CUG) 2013. • Kevin Pedtre\, Courtenay Vaughan, Richard BarreS, Karen Devine, K. ScoS Hemmert, “Using the Cray Gemini Performance Counters”, In Cray User Group (CUG) 2013. An Analysis Of Network Congestion in the Titan Supercomputer’s Interconnect Titan implements the Cray Gemini interconnect, which has a 3D Torus topology. 3D Torus topology is more prone to interference among applicaBons sharing the network. Each node in the 3D Torus network is a Gemini router which hosts two compute nodes in Titan. A compute node is comprised of a CPU and a GPU. Node Neighborhood We used two different neighborhoods of nodes for our analyses: • Direct Path • OneHop Expanded The direct path spans the intermediate connecBons (routers) between the two test nodes. The OneHop Expanded includes nodes immediately surrounding the direct path in all dimensions. Expanding the neighborhood allows us to inves9gate the effects of the surrounding area’s status at the 9me of the throughput test. Correla8on with Number of Busy Nodes From the analysis of the amount of acBvity, or number of busy nodes in the neighborhood, we were able to find some correlaBons with respect to throughput. We quanBfied these correlaBons using Pearson’s and Spearman’s correlaBon coefficients. Our data analysis shows: • moderate amount of correlaBon exists between number of busy nodes and throughput for all dimensions • correlaBon is stronger in X and Y dimensions Direct Path OneHop Expanded Neighborhood Dimension Pearson Coefficient Pearson pvalue Spearman Coefficient Spearman pvalue Direct Path X 0.421876 0.000000 0.519125 0.000000 Y 0.467197 0.000000 0.713708 0.000000 Z 0.269941 0.000000 0.099382 0.017526 1 hop X 0.421416 0.000000 0.505862 0.000000 Y 0.404840 0.000000 0.615761 0.000000 Z 0.384263 0.000000 0.388042 0.000000 High number of busy nodes nega8vely affects throughput. Pearson & Spearman Correla8ons For the analyses of the number of “busy nodes” and neighboring applicaBons, we used two node neighborhoods—the Direct Path and the OneHop Expanded. This project explored the factors causing network congesBon and invesBgated which applicaBons are likely to interfere with others. Correla8on with Applica8ons Introduc8on Methods Overview Largescale systems, like Titan, rely on high speed interconnects to benefit from using vast amounts of parallelism. This interconnect is essenBal for compute nodes to communicate data with each other throughout the computaBon. If the interconnect becomes congested with data, the throughput drops— resulBng in a slower compute Bme for researchers. 3D Torus Gemini Router We collected data by tesBng the throughput between two nodes. We then invesBgated this data to find correlaBons between throughput and the following three variables: • number of “busy nodes” • neighboring applicaBons • distance between test nodes Effect of Path Distance Due to the 3D Torus design, the maximum length path actually occurs at the median distance of each dimension. Conclusion We can successfully idenBfy applicaBons that need invesBgaBon into the causes of network congesBon. Our study shows that network throughput is affected by: • number of busy nodes nearby • certain applicaBons running on nearby nodes • path distance between the communicaBng nodes Using the OneHop Expanded node neighborhood we were able to find applicaBons we suspect cause network congesBon. Direct Path OneHop Expanded Confidence – Presence – how ooen low throughput occurred with a given applicaBon in the neighborhood average percentage of nodes in neighborhood that are occupied by a given applicaBon Some neighboring applica8ons could cause low throughput. Note: ApplicaBon names/users replaced with IDs.