UMASS – UNIVERSITY OF MASSACHUSETTS MEDICAL SCHOOL, USA
Size: Extend the UMass Medical HPC InfiniBand fabric, post re-cabling projects we did in the last couple of engagements.
Primary SI: DELL
- The UMASS admins were getting a lot of criticism from their customer base.
- The cluster was slow and unreliable.
- One of the key professor and researchers was very critical of the network performance.
- Congestion on InfiniBand fabric
- Performance issue detected on some nodes
- The Admins didn’t know what to do and were paralyzed with fear about making any changes.
- Bottlenecks Issues caused by
- Inefficient Design
- Fabric Health
Mellanox started with baby steps, rebuilding the customer’s confidence with InfiniBand as a priority while addressing the technical items one by one.
- Ten New EDR switches – “Bring-up” Services
- Add two Ethernet switches
- OS Upgrade of four existed switches.
- Add one more new leaf IB switch
- Peak throughput has increased 300 %
- Average network latency reduced by 40 %.
- ‘99 % Latency’ was reduced by 35 %. (99 % Latency = Max latency for 99 % of network traffic)