Oracle Principal Site Reliability Engineer - REMOTE in Denver, Colorado
Solve complex problems related to infrastructure cloud services and build automation to prevent problem recurrence. Design, write, and deploy software to improve the availability, scalability, and efficiency of Oracle products and services. Design and develop designs, architectures, standards, and methods for large-scale distributed systems. Facilitate service capacity planning and demand forecasting, software performance analysis, and system tuning.
Work with Site Reliability Engineering (SRE) team on the shared full stack ownership of a collection of services and/or technology areas. Understand the end-to-end configuration, technical dependencies, and overall behavioral characteristics of production services. Responsible for the design and delivery of the mission critical stack, with focus on security, resiliency, scale, and performance. Authority for end-to-end performance and operability. Partner with development teams in defining and implementing improvements in service architecture. Articulate technical characteristics of services and technology areas and guide Development Teams to engineer and add premier capabilities to the Oracle Cloud service portfolio. Understand and communicate the scale, capacity, security, performance attributes, and requirements of the service and technology stack. Demonstrate clear understanding of automation and orchestration principles. Act as ultimate escalation point for complex or critical issues that have not yet been documented as Standard Operating Procedures (SOPs). Utilize a deep understanding of service topology and their dependencies required to troubleshoot issues and define mitigations. Understand and explain the affect of product architecture decisions on distributed systems. Professional curiosity and a desire to a develop deep understanding of services and technologies.
A BS or MS in Computer Science, or equivalent. Identifies and implements complex solutions to knowledge of server hardware and software configuration, networking, standard internet services, scripting languages, cloud computing patterns, technology security and compliance. Experience running large scale customer facing web services. Identifies and implements complex solutions to understanding of load balancing technologies and experience with development in programming languages, databases and big data stores, and container technologies. Work involves defining and documenting technical architecture of complex and highly scalable products. A minimum of 8 years experience of running large scale customer facing web services.
Oracle is an Affirmative Action-Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability, protected veterans status, age, or any other characteristic protected by law.
Principal Site Reliability Engineer, Hospitality Cloud - REMOTE
NOTE: We are unable to provide visa sponsorship for this role at this time. No candidates requiring visa sponsorship will be considered.
The Hospitality Cloud SRE team is focused on maximizing service reliability for our hotel product service offerings across global Oracle data centers. Our team runs with a start-up like approach, leaving room for creative freedom. We have worked to assemble the smartest people in the industry to build and grow this revolutionary and disruptive team.
We are looking to add new members to this dynamic team and are seeking subject matter experts for designing and continuously improving reliability for all components within our solution portfolio while we deconstruct the monolith and move to Oracle Cloud Infrastructure (OCI).
About The Job
As part of the SRE team, you will be continually challenged and directly contribute to the success of our Oracle Hospitality cloud service offerings, every day, working closely with product and Infrastructure partners.
As an SRE, you will solve interesting technical challenges by defining, designing, deploying and troubleshooting key HGBU products, Oracle Cloud services, platforms, and infrastructure, always thinking about reliability, scalability, resilience, security, and performance.
In this role, which is a mix of software developement, architecture and operational readiness, you will be responsible for the following:
Service Ownership–You will be part of the SRE team, whose mission is the shared full stack ownership of a collection of services and/or technology areas, with our Core Development partners.
Ownership Scope– As an SRE, you will understand the end-to-end configuration, technical dependencies, and overall behavioural characteristics of the production services you own. In partnership with your Core Development partners, you will have the responsibility to ensure that services are designed, delivered and deployed to be mission critical with focus on security, resiliency, scale, and performance. SREs are accountable for the end-to-end performance and operability of the services they own.
Service Design– As Oracle Hospitality Cloud Services continually evolve; you will partner with development teams in defining and implementing improvements in service architecture, both current and future. As an SRE, you will be an expert at articulating technical characteristics of your services and the dependencies between services, and guide Development teams to engineer and add premier capabilities to the Oracle Cloud service portfolio.
Operations Engineering– You will understand and be able to communicate the scale, capacity, security, performance attributes and requirements of the services you own. To understand and communicate every characteristic of their service stack, such as:
degradation and behaviour under load of the services and their dependencies
end-to-end tuning needs, optimizing resource utilization, as load patterns fluctuate
Instrumentation and metrics that clearly describe the service behaviours
scaling requirements and patterns
resiliency and recoverability, ensuring that backup / restore and disaster recovery capabilities are implemented, tested and maintained
Automation –You will have a clear understanding of automation and orchestration principles, and will be eager to automate, wherever and whenever the possibility arises, while simultaneously eliminating technical debt. Automation must be part of your DNA.
Broad Interests- SREs are a rare mix of sysadmins and Software development Engineers, and as such have the ability to understand and explain the effect of product architecture decisions on the ability to run as distributed systems. They are driven by professional curiosity and a desire to develop deep understanding of their services and the technologies they depend upon.
Ideal Qualification/ Experience
BS or MS in Computer Science, or equivalent work experience
Minimum 5 years developing software, Enterprise or Start-up background. Mainly Python background. Other languages such as PHP considered.
Minimum 5 years Senior DevOps , SRE/SysAdmin experience
Must: Understanding of Cloud Native Technologies and appreciation of Cloud Native Computing Foundation (CNCF) Charter.
Must: Oracle Cloud Infrastructure Certification / experience
Must have: Knowledge of Containers, developing software to work in containers and Container orchestration technologies
Would like: Knowledge of migrating from Monolith to Microservices. Implementing Strangler Application or Anti-corruption layer style programming
Must have: Git experience and Git flow knowledge, be able to work in a team with many different developers committing code.
Must have: Used and implemented a full CI/CD pipeline from push to release.
Must have knowledge of areas outside of their own setting, keep up to date with technologies and direction the industry is working today.
Knowledge and experience of Observability and Observability enabling tools
Knowledge of secure coding practices, secure software practices, OWASP and be able to help other developers to use practices such as static code analysers.
Must: Analyse software components and recommend modifications that will enhance system reliability, availability and scalability.
Knowledge of networking and security i.e. Certificates, DNS records, Load Balancers (F5 / LbaaS/NGINX), subnets, TLS, SSL, SAML, TCP/IP / Wireshark
Conducting performance testing and tuning to maintain system stability and offer guidance at the start of a project and during a project to improve performance
Knowledge of performance monitoring, use of profilers, APM, Flight Recorders and offer guidance on improvement
Knowledge of Agile methods, and SAFE agile if possible
Experience with automation/configuration management using either Terraform/Puppet/Chef or an equivalent
Methodical approach to troubleshooting complex problems
Defining and documenting technical architecture of complex and highly scalable products
Most importantly, the aptitude to be a good team player and the willingness to learn and implement new Cloud technologies
Detailed Description and Job Requirements
Design, develop, troubleshoot and debug software programs for databases, applications, tools, networks etc.
As a member of the SRE division, you will take an active role in the definition and evolution of standard practices and procedures. You will be responsible for defining and developing software for tasks associated with the developing, designing and debugging of software applications or operating systems.
Work is non-routine and very complex, involving the application of advanced technical/business skills in area of specialization. Leading contributor individually and as a team member, providing direction and mentoring to others. BS or MS degree or equivalent experience relevant to functional area. 5 years of software engineering or related experience.
/At Oracle, we don’t just value differences—we celebrate them. We’re committed to creating a workplace where all kinds of people work together. We believe innovation starts with diversity and inclusion./
Job: *Product Development
Title: Principal Site Reliability Engineer - REMOTE
Location: United States
Requisition ID: 200015V6