Dell Site Reliability Engineer, Observability in Bedford, Massachusetts
Job Posting Title: Site Reliability Engineer, Observability
RSA creates a wide range of industry-leading products that allow customers to take control of risk. Whether those risks stem from external cyber threats, identity and access management challenges, online fraud, compliance pressure or any number of other business and technology issues.
Our customers expect our services to meet all availability and performance SLAs. We are building out expertize in Site Reliability Engineering and expanding our use of DevOps methodologies. As a new role for the global 24/7 SaaS Operations group, this is an exciting opportunity for a seasoned engineer to have a positive impact across all teams and services.
You will work closely with Engineering Architecture, Development, Infrastructure, DBA, Application Support, Security Operations and our NOC. You will ensure that tools provide the required visibility in to environments for efficient, effective support, Root Cause Analysis and predictive analytics.
You will be expected to be able to understand operational issues across the full stack. You will also need to understand how to create common processes and systems to cover heterogeneous environments across the cloud and in traditional datacenters.
PRINCIPAL DUTIES AND RESPONSIBILITIES
Research, evaluate, develop, maintain and support observability tool suite across cloud and data center environments
Partner with development teams to ensure applications are instrumented to provide visibility of performance metrics
Develop automations and integrations for deployment of monitoring tools
Develop and maintain external synthetic monitoring and RUM
Improve root cause identification speed and efficiency
Work cross-functionally to define KPIs used to measure operational efficiency, capacity and availability of environments
Generate internal and customer facing dashboards and reports required by engineering and product support teams
Support activities that ensure that monitoring infrastructure meets all security and compliance requirement
KNOWLEDGE & SKILLS
Experience integrating monitoring cloud, AWS/Azure (Flow Logs, CloudTrail, CloudWatch, GuardDuty etc)
Experienced with DataDog/Dynatrace (or equivalent) for root cause analysis of performance issues, capacity, reliability, and scalability
Experience with additional Open Source monitoring tools preferred (Grafana, Prometheus, ELK, Hobbit etc.)
Experience with web servers and application stacks (Tomcat, JBoss, Nginx, Apache, .NET)
Scripting/coding skills (e.g., Ruby, Python, Java)
Experienced with RUM and external performance monitoring dashboards (Pingdom)
Working knowledge of code pipeline tools advantageous
Working knowledge of Linux, Windows, virtualization stacks, databases, storage and networking devices
Demonstrable knowledge of TCP/IP, HTTP, web application security, and experience supporting multi-tier web application architectures
Experience with Infrastructure Monitoring (Solarwinds preferred)
Problem solving skills and ability to work in a rapid paced, customer facing, 24/7 production environment
Proven successful project management skills and technical leadership
Excellent written and verbal communication and documentation skills
Ability to work within a global team and strong work ethic
3+ years’ experience with monitoring applications and infrastructure stacks
Experience with AWS/Azure cloud and traditional datacenters required
Hands-on experience troubleshooting and tuning preferred
10 + years and a BS in CS, IT, or related field or equivalent work experience
" LI Priority "