Operations and administration - for SaaS exploitation¶
All deployments are automated thanks to our Ansible playbooks.
All our playbooks are versionned, maintened and reviewed by the Toucan Toco tech team.
As a good practice, we never need to directly connect to a Toucan Toco’s node to run manually commands. We reduce the risk of human fault and deploy process is able to auto-scale.
We also centralized system logs like syslog, auth, nginx (access and error) and fail2ban. We are able to detect brute force, spams and malicious behavior thanks to our dashboards. For each detected pattern, we rise automated alerts thanks to Elastalert.
Our log rentention policy is about 8 weeks in our Elastic stack but we keep - by default - 14 weeks of web access/error logs and 52 weeks of app logs in our servers.
As a good practice, we never need to directly connect to a Toucan Toco’s node to follow the activities and logs. It’s a main point in our ability to scale our monitoring and how we follow the activity.
We regularly and automatically scan our servers in search for:
- open ports
- lacks of security updates
We ensure to have an up-to-date environment (system, security, patch…).
Our monitoring services alert us when:
- a server becomes unresponsive
- a server present unusual CPU, memory or disk activity
- a server is getting closer of its hardware limits
- an application status page is not OK
- one of the following port is not listening: 443/80/22
Furthermore, we use OSSEC to alert us of possible intrusions.
This monitoring run 24/7 and every alerts is acked that ensure a fast reaction from the Toucan Toco tech team.
Every week these services send us detailed performance and uptime reports.
These regular reports help us to identify pontential regressions or bottlenecks which could be fixed.
Watch and patch management¶
To discover new vulnerabilities and patch against them as quickly as possible, we follow:
|Database||MongoDB||Mongo CVE DAdministration, exploitation & internal security details|
|Database||MongoDB||Mongo Security Checklist|
|Application||Python||Python CvAdministration, exploitation & internal security Database|
|Container||Docker||Docker Dev Mailing list|
|Container||Docker||Docker User Mailing list|
|System||Ubuntu||Ubuntu LTS packages|
|System||Ubuntu||Ubuntu Security List|
|System||Debian||Debian Security List|
And Github’s issues/announces of main projects we use.
As soon a security patch is available, we automatically applied it to our whole infrastructure by using our Ansible scripts.
Otherwise, our infrastructure is fully updated every 2 months with our Ansible playbooks. But before applying updates everywhere, we use a staging node to be sure there will be no regression.
This update processes can very occasionally lead to a short downtime that we do out of office hours.
If the infrastructure or the applications are impacted by a known vulnerability, we always send a mail report to the client to warn and explain how we recover it.
We run a daily backup process for each instance/project.
The backup is a complete snapshot which is encrypted by a GPG key
(dedicated to the instance/project) and exported over
our exclusive backup nodes.
GPG keys are only available to the Toucan Toco’s admins and stored in our passwords manager system.
All the backups are exported to our dedicated storage servers hosted by OVH in another region.
By default we keep a retention of 20 daily backups for each instance/project.
We also regularly test and challenge our backup and restoration scripts.
Restoring an instance or a project is a fully automated and fast process.
By culture, we keep a logbook of every issues on the infrastructure.
Each logbook entry describes:
- what’s going on
- how did we understand the issue
- what did we do to solve the problem
- what are the impacts
- what do we need to do to avoid it next times
The logbook is open to every Toucan Toco employees. The knowledge, about the life and the issues on the infrastructure, is shared and maintained by every one.
Communication during issues¶
As soon as we detect an issue, your dedicated account manager and/or client success manager will contact you to explain the issue, the potential impacts and give you an estimated resolution time.
When the issue is closed, you can expect a post-mortem report, mainly extracted from our logbook (cf previous paragraph), with details about the investigation and the resolution process.
This emergency communication is available 24x7.
Our main support channels are emails and our Discourse platform.
This support is open between 9:00 and 18:00 (Paris hour) during the working days.
On-Call duty team¶
Project instance and server decomission¶
Each time we need to decomission a project instance:
- the dedicated stack is shutdown (virtual hosts, API process, workers, queue server, database)
- all data, logs and associated configuration are erased
Each time we need to decomission a server:
- data and home partition are fully formatted
- we force a basic rewrite of the partition (with a basic
ddcommand), thus no block could be restored from their previous state
- then we release the server to Scaleway.
A decomissioned server is always left without any data.
We have exactly the same approach for any SAN or storage volumes.