Training and Education for HPC System Administrators: How Can We Do Better?

Authors: Neelofer Banglawala (Edinburgh Parallel Computing Centre), Bryan Johnston (Centre for High Performance Computing, South Africa), Christopher Harrison (University of Wisconsin, University of Porto)

Abstract: Skilled and successful HPC System Administrators (SysAdmins) are the bedrock of the HPC community, yet well-established pathways and resources for the necessary skill acquisition remain sparse. There is an increasing desire within the community to change this. This session aims to bring together all those interested in HPC SysAdmin training and education to address key issues, such as: what training currently exists? What support do new versus experienced SysAdmins need? How can we share resources and expertise? When one-size-does-not-fit-all, can we establish formal baseline standards? How can we do better? All are welcome to join the discussion!

Long Description: Successful HPC centres crucially depend on highly-trained and effective HPC System Administrators (SysAdmins). Yet well-established pathways and resources for obtaining the necessary skills to become successful HPC SysAdmins are severely lacking. This is in stark contrast to the considerable and widely available training resources for HPC users.

HPC SysAdmin training typically involves: shadowing colleagues, home-grown system-specific documentation, support from local sub-communities, vendor-specific training and much trial and error. Further, there are no widely available formal training resources for newcomers to learn how to become successful HPC SysAdmins. HPC SysAdmins teams typically learn through "putting out fires" and making mistakes, which can strain the HPC system, the SysAdmin team and system users. This can lead to suboptimal HPC resource management and poor practices. As HPC expands, new sites require the right skills to administer and develop their HPC resources.

Many experienced HPC SysAdmins do not have the time or opportunity to train outsiders, often limiting local training and support, which is usually uncoordinated and not standardised, to a small sub-community, e.g. colleagues within the same department. Finding technical support to establish new HPC centres is difficult, whilst experienced SysAdmin teams managing more sophisticated systems need more specialised training.

A one-size-fits-all approach to HPC SysAdmins training is regarded as unrealistic since different sites have different resources and configurations, and personnel with different backgrounds. However, SysAdmin teams across the HPC community express the desire to share experiences and training resources, learn from each other and establish formal baseline training and qualifications.

This session aims to engage all those interested in HPC SysAdmins training and education on how to improve the current situation by enabling a focused discussion around several key issues, including:

What training currently exists? How effective is it? What are the training needs of HPC SysAdmins, as opposed to Enterprise SysAdmins? How do training needs vary by the experience of SysAdmins and the types of systems they manage? What help is there for newcomers and those establishing HPC in new communities? How can experienced SysAdmins access and share specialised knowledge and skills with less experienced SysAdmins? How can we effectively share training resources, expertise and knowledge? What would be a standard base set of skills and knowledge? What are common pathways to specialisations? Could defining career paths for the many different roles under the umbrella term of "HPC SysAdmin" help inform training needs?

The session will conclude with finding ways in which the community can begin sharing training resources and continue communication on the issues raised e.g. a central repository and messaging platform. Key individuals and/or groups will be made responsible for ensuring the agreed upon short-term goals are achieved within a given timeframe. All session output will be captured in several blog posts.

By enabling a community-wide discussion on improving the training and education needs for HPC SysAdmins, we hope to help address an important and much neglected need within the community.

