Top Banner
Jonathan Freed, University of South Carolina Saurabh Gupta and Devesh Tiwari, Oak Ridge National Laboratory (OLCF) Acknowledgments This work was supported in part by the U.S. Department of Energy, Office of Science, Office of Workforce Development for Teachers and ScienBsts (WDTS) under the Science Undergraduate Laboratory Internship program. This research used resources of the Oak Ridge Leadership CompuBng Facility at the Oak Ridge NaBonal Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DEAC0500OR22725. I would like to thank the University of South Carolina’s Office of Undergraduate Research and Computer Science and Engineering Department, as well as ACM’s SRC for travel support. References MaS Ezell, “Understanding the Impact of Interconnect Failures on System Opera9on”, In Cray User Group (CUG) 2013. Kevin Pedtre\, Courtenay Vaughan, Richard BarreS, Karen Devine, K. ScoS Hemmert, “Using the Cray Gemini Performance Counters”, In Cray User Group (CUG) 2013. An Analysis Of Network Congestion in the Titan Supercomputer’s Interconnect Titan implements the Cray Gemini interconnect, which has a 3D Torus topology. 3D Torus topology is more prone to interference among applicaBons sharing the network. Each node in the 3D Torus network is a Gemini router which hosts two compute nodes in Titan. A compute node is comprised of a CPU and a GPU. Node Neighborhood We used two different neighborhoods of nodes for our analyses: Direct Path OneHop Expanded The direct path spans the intermediate connecBons (routers) between the two test nodes. The OneHop Expanded includes nodes immediately surrounding the direct path in all dimensions. Expanding the neighborhood allows us to inves9gate the effects of the surrounding area’s status at the 9me of the throughput test. Correla8on with Number of Busy Nodes From the analysis of the amount of acBvity, or number of busy nodes in the neighborhood, we were able to find some correlaBons with respect to throughput. We quanBfied these correlaBons using Pearson’s and Spearman’s correlaBon coefficients. Our data analysis shows: moderate amount of correlaBon exists between number of busy nodes and throughput for all dimensions correlaBon is stronger in X and Y dimensions Direct Path OneHop Expanded Neighborhood Dimension Pearson Coefficient Pearson pvalue Spearman Coefficient Spearman pvalue Direct Path X 0.421876 0.000000 0.519125 0.000000 Y 0.467197 0.000000 0.713708 0.000000 Z 0.269941 0.000000 0.099382 0.017526 1 hop X 0.421416 0.000000 0.505862 0.000000 Y 0.404840 0.000000 0.615761 0.000000 Z 0.384263 0.000000 0.388042 0.000000 High number of busy nodes nega8vely affects throughput. Pearson & Spearman Correla8ons For the analyses of the number of “busy nodes” and neighboring applicaBons, we used two node neighborhoods—the Direct Path and the OneHop Expanded. This project explored the factors causing network congesBon and invesBgated which applicaBons are likely to interfere with others. Correla8on with Applica8ons Introduc8on Methods Overview Largescale systems, like Titan, rely on high speed interconnects to benefit from using vast amounts of parallelism. This interconnect is essenBal for compute nodes to communicate data with each other throughout the computaBon. If the interconnect becomes congested with data, the throughput drops— resulBng in a slower compute Bme for researchers. 3D Torus Gemini Router We collected data by tesBng the throughput between two nodes. We then invesBgated this data to find correlaBons between throughput and the following three variables: number of “busy nodes” neighboring applicaBons distance between test nodes Effect of Path Distance Due to the 3D Torus design, the maximum length path actually occurs at the median distance of each dimension. Conclusion We can successfully idenBfy applicaBons that need invesBgaBon into the causes of network congesBon. Our study shows that network throughput is affected by: number of busy nodes nearby certain applicaBons running on nearby nodes path distance between the communicaBng nodes Using the OneHop Expanded node neighborhood we were able to find applicaBons we suspect cause network congesBon. Direct Path OneHop Expanded Confidence Presence how ooen low throughput occurred with a given applicaBon in the neighborhood average percentage of nodes in neighborhood that are occupied by a given applicaBon Some neighboring applica8ons could cause low throughput. Note: ApplicaBon names/users replaced with IDs.
1

An Analysis Of Network Congestion in the Titan Supercomputer’s …sc15.supercomputing.org/sites/all/themes/SC15images/src... · 2016. 5. 10. · An Analysis Of Network Congestion

Mar 26, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: An Analysis Of Network Congestion in the Titan Supercomputer’s …sc15.supercomputing.org/sites/all/themes/SC15images/src... · 2016. 5. 10. · An Analysis Of Network Congestion

           

 

 

Jonathan Freed, University of South CarolinaSaurabh Gupta and Devesh Tiwari, Oak Ridge National Laboratory (OLCF)

Acknowledgments  •  This  work  was  supported  in  part  by  the  U.S.  Department  of  Energy,  Office  of  Science,  Office  of  

Workforce  Development  for  Teachers  and  ScienBsts  (WDTS)  under  the  Science  Undergraduate  Laboratory  Internship  program.  

•  This  research  used  resources  of  the  Oak  Ridge  Leadership  CompuBng  Facility  at  the  Oak  Ridge  NaBonal  Laboratory,  which  is  supported  by  the  Office  of  Science  of  the  U.S.  Department  of  Energy  under  Contract  No.  DE-­‐AC05-­‐00OR22725.  

•  I  would  like  to  thank  the  University  of  South  Carolina’s  Office  of  Undergraduate  Research  and  Computer  Science  and  Engineering  Department,  as  well  as  ACM’s  SRC  for  travel  support.  

References  •  MaS  Ezell,  “Understanding  the  Impact  of  Interconnect  

Failures  on  System  Opera9on”,  In  Cray  User  Group  (CUG)  2013.  

•  Kevin  Pedtre\,  Courtenay  Vaughan,  Richard  BarreS,  Karen  Devine,  K.  ScoS  Hemmert,  “Using  the  Cray  Gemini  Performance  Counters”,  In  Cray  User  Group  (CUG)  2013.    

An Analysis Of Network Congestion in the Titan Supercomputer’s Interconnect

Titan   implements  the   Cray   Gemini  interconnect,  which  has   a   3D   Torus  topology.      3D   Torus   topology  is   more   prone   to  interference  among  applicaBons  sharing  the  network.  

Each   node   in   the   3D  Torus   network   is   a  Gemini   router   which  hosts   two   compute  nodes   in   Titan.   A  compute   node   is  comprised   of   a   CPU  and  a  GPU.  

Node  Neighborhood  We   used   two   different   neighborhoods   of   nodes   for   our  analyses:  

•  Direct  Path  •  One-­‐Hop  Expanded  

The   direct   path   spans   the   intermediate   connecBons  (routers)   between   the   two   test   nodes.   The   One-­‐Hop  Expanded   includes   nodes   immediately   surrounding   the  direct  path  in  all  dimensions.  

Expanding  the  neighborhood  allows  us  to  inves9gate  the  effects  of  the  surrounding  area’s  status  at  the  9me  of  the  throughput  test.  

Correla8on  with  Number  of  Busy  Nodes  From  the  analysis  of  the  amount  of  acBvity,  or  number  of  busy  nodes   in   the   neighborhood,   we   were   able   to   find   some  correlaBons   with   respect   to   throughput.   We   quanBfied   these  correlaBons   using   Pearson’s   and   Spearman’s   correlaBon  coefficients.    

Our  data  analysis  shows:  •  moderate  amount  of  correlaBon  exists  between  number  of  busy  nodes  and  throughput  for  all  dimensions  

•  correlaBon  is  stronger  in  X  and  Y  dimensions  

Direct  Path  

One-­‐Hop  Expanded  

Neighborhood   Dimension   Pearson  Coefficient  

Pearson    p-­‐value  

Spearman    Coefficient  

Spearman    p-­‐value  

Direct  Path  

X   -­‐0.421876   0.000000   -­‐0.519125   0.000000  

Y   -­‐0.467197   0.000000   -­‐0.713708   0.000000  

Z   -­‐0.269941   0.000000   -­‐0.099382   0.017526  

1  hop  

X   -­‐0.421416   0.000000   -­‐0.505862   0.000000  

Y   -­‐0.404840   0.000000   -­‐0.615761   0.000000  

Z   -­‐0.384263   0.000000   -­‐0.388042   0.000000  

High  number  of  busy  nodes  nega8vely  affects  throughput.  

Pearson  &  Spearman  Correla8ons  

For  the  analyses  of  the  number  of  “busy  nodes”  and   neighboring   applicaBons,   we   used   two  node   neighborhoods—the  Direct   Path   and   the  One-­‐Hop  Expanded.    This   project   explored   the   factors   causing  network   congesBon   and   invesBgated   which  applicaBons  are  likely  to  interfere  with  others.  

Correla8on  with  Applica8ons  Introduc8on  

Methods  Overview  

Large-­‐scale   systems,   like   Titan,   rely   on   high-­‐speed   interconnects   to  benefit   from  using   vast  amounts   of   parallelism.   This   interconnect   is  essenBal   for   compute   nodes   to   communicate  data   with   each   other   throughout   the  computaBon.   If   the   interconnect   becomes  congested   with   data,   the   throughput   drops—resulBng   in   a   slower   compute   Bme   for  researchers.  

3D  Torus  

Gemini  Router  

We   collected   data   by   tesBng   the   throughput  between   two  nodes.  We   then   invesBgated   this  data   to   find   correlaBons   between   throughput  and  the  following  three  variables:    

•  number  of  “busy  nodes”  •  neighboring  applicaBons  •  distance  between  test  nodes  

Effect  of  Path  Distance  Due  to  the  3D  Torus  design,  the  maximum  length  path   actually   occurs   at   the   median   distance   of  each  dimension.  

Conclusion  We   can   successfully   idenBfy   applicaBons   that  need   invesBgaBon   into   the   causes   of   network  congesBon.   Our   study   shows   that   network  throughput  is  affected  by:  

•  number  of  busy  nodes  nearby  •  certain   applicaBons   running   on   nearby  nodes  

•  path  distance  between  the  communicaBng  nodes  

 

Using  the  One-­‐Hop  Expanded  node  neighborhood  we   were   able   to   find   applicaBons   we   suspect  cause  network  congesBon.  

Direct  Path  One-­‐Hop  Expanded  

Confidence  –    

Presence  –    

how  ooen  low  throughput  occurred  with  a  given  applicaBon  in  the  neighborhood  average  percentage  of  nodes  in  neighborhood  that  are  occupied  by  a  given  applicaBon  

Some  neighboring  applica8ons  could  cause  low  throughput.  

Note:  ApplicaBon  names/users  replaced  with  IDs.