You will be an active member of the account team, interacting with the Program Manager, Site Lead, Customer, and site staff attending regularly scheduled customer meetings to keep the customer informed of activities and progress, answer customer inquiries concerning all aspects of the various HPC systems and private and public cloud infrastructures.
The Cloud System Administrator is responsible for direct system administration as well as leading a small team of system administrators in integrating, supporting and troubleshooting a compute intensive private cloud environment. Primary system administration duties include maintaining a 300+ node OpenStack environment, KVM virtualization, multi-petabyte GPFS storage filesystem and SLURM workload management supporting a technical workload primarily doing analytics and scientific computing in the realm of climate science.
Duties and Responsibilities:
- Troubleshooting and resolving in-depth technical problems on Linux
- Required skills directing the work of other system administrators in integration and in troubleshooting.
- Monitor systems performance and maintain high-availability of critical company systems
- Test and certify security patches and new software before production deployment
- Provide off-hours support as required to maintain the availability of key systems and services
- Working on custom special projects as assigned
- Participate in weekly teleconference team meetings and prepare minutes of meetings.
- Deploy and test OpenStack cloud
- Troubleshoot encountered issues and provide solutions.
- Contribute to deployment and administrative repositories through creating Puppet manifests, Shell scripts, Python scripts etc.
- Actively contribute to reviews and documentation
- Design and implement/customize OpenStack features, fix defects and provide improvements wherever required in Python
- Understand OpenStack ecosystem, engage in discussions with the OpenStack community, implement best practices
Education: Bachelor's degree or equivalent; plus 5 years of experience. Master's degree or equivalent.
- 3-5 years day-to-day operational support for a production Linux-based environment that relies entirely on open source software
- Past experience running Sun Solaris Unix is helpful
- Thorough understanding of Layer 2 and Layer 3 networking
- Recent cloud computing experience utilizing AWS (preferred), Google Cloud Platform, MS Azure, or other related platforms, and/or private cloud deployments utilizing OpenStack or similar.
- Experience with Nagios or similar open source monitoring solutions; or ELK stack
- Required skills include Linux, OpenStack, KVM, GPFS, SLURM, Puppet and automation of AWS or other public cloud vendors.
- Experience with configuration management tools: CFengine, Puppet, or Ansible
- Strong scripting abilities (Preferably Bash or Python; also will consider Korn, PowerShell, Ruby)
- Experience with Automation technologies covering automated deployment, configuration, testing, monitoring
- Experience in troubleshooting rpm or apt Linux Distributions
- Experience configuring network switches is a plus
- Understanding of shared storage systems, especially NFS
- Understanding of relational database systems, especially MySQL
- Some exposure to LDAP
- Ability to troubleshoot Apache HTTPD virtual host configurations, especially with support for mod_ssl and mod_wsgi
- Experience with Subversion or Git
- Experience in any Unix-based virtualized or container environment: Zen, KVM, VMware, LXC, Docker, Kubernetes
- Demonstrable understanding of tcpdump, strace, netstat, sed, awk, iptables, and ssh
- Experience with web based development, scientific computing or numerical analytics a plus.
- Experience in communicating with users, other technical teams, and management to collect requirements, describe software product features, and technical designs.
- Good organization skills to balance and prioritize work, and ability to multitask
- Good communication skills to communicate with support personnel, customer, and managers
- US Citizenship or Permanent Resident