Server Maintenance Checklist

Servers are amazing things.  They hum along 24/7, usually without issue, but like any machine they do require some maintenance.

Simple maintenance and monitoring can often prevent a server failure from turning into a server disaster.   For example, I’ve had people call in a panic that there server has crashed.  We begin to investigate to discover that their RAID failed last year, their backups stopped three months ago and their disk reached 100% capacity, corrupting their database.

If you use our managed services, you don’t have to worry about these things.  We monitor, review and maintain things 24/7, but if you are managing your own server, here are twelve items that should be part of your server maintenance checklist.

12 Server Maintenance Tips

1. Verify your backups are working.

Before making any changes to your production system, be sure that your backups are working. You may even want to run some test recoveries if you are going to delete critical data. While focused on backups, you may want to make sure you have selected the right backup location.

2. Check disk usage.

Don’t use your production system as an archival system. Delete old logs, emails, and software versions no longer used. Keeping your system free of old software limits security issues. A smaller data footprint means faster recovery.  If your usage is exceeding 90% of disk capacity, either reduce usage or add more storage. If your partition reaches 100%, your server may stop responding, database tables can corrupt and data may be lost.

3. Monitor RAID Alarms.

All production servers should use RAID.   More importantly, you should be monitoring your RAID status.   In our decade plus in business, we have worked on countless systems where the RAID failed.  As a result, a single disk failure caused a complete system failure.  At rackAID, we either use providers that monitor our RAID for us or we have setup direct RAID monitoring.   Roughly I estimate that RAID fails in about 1% of servers per year.  One percent may seem small, but a complete server failure can turn a simple drive replacement into a multi-hour disaster recovery scenario.

4. Update Your OS.

Updates for Linux systems are release frequently. Staying on top of these updates can be challenging.  This is why we use automated patch management tools and have monitoring in place to alert us when a system is out of date.  If you are updating your server manually (or not at all), you may miss important security updates.  Hackers often scan for vulnerably systems within hours of a issue being disclosed.  So rapid response is key.  If you cannot automate your updates, then create a schedule to update your system.   I recommend weekly at a minimum for current versions and perhaps monthly for older OS versions.  I would also monitor release notices from your distribution so you are aware of any major security threats and can respond quickly.

5. Update your Control Panel.

If you are using a hosting or server control panel, be sure to update it as well. Sometimes this means updating not only the control panel itself, but also software it controls. For example, with WHM/cPanel, you must manually update PHP versions to fix known issues. Simply updating the control panel does not also update the underlying Apache and PHP versions used by your OS.

6. Check application updates.

Web applications account for more than 95% of all security breaches we investigate.  Be sure to update your web applications, especially popular programs like WordPress.

7. Check remote management tools.

If your server is co-located or with a dedicated server provider, you will want to check that your remote management tools work. Remote console, remote reboot and rescue mode are what I call the 3 essential tools for remote server management. You want to know that these will work when you need them.

8. Check for hardware errors.

You may want to review the logs for any signs of hardware problems. Overheating notices, disk read errors, network failures could be early indicators of potential hardware failure. These are rare but worth a look, especially if the system has not been working within normal ranges.

9. Check server utilization.

Review your server’s disk, CPU, RAM and network utilization. If you are nearing limits, you may need to plan on adding resources to your server or migrating to a new one.  If you are not using a performance monitoring tool, you can install systat on most Linux servers.  This will provide you some baseline performance data.

10. Review user accounts.

If you have had staff changes, client cancellations or other user changes, you will want to remove these users from your system. Storing old sites and users is both a security and legal risk. Depending on your service contracts, you may not have the right to retain a client’s data after they have terminated services.

11. Change passwords.

I recommend changing passwords every 6 to 12 months, especially if you have given out passwords to others for maintenance.

12. Check system security.

I suggest a periodic review of your server’s security using a remote auditing tool such as Nessus. Regular security audits serve as a check on system configuration, OS updates and other potential security risks. I suggest this at least 4 times a year and preferably monthly. Also, you may want to revisit the 10 immutable laws of security administration.

Being Proactive Prevents Failures

As part of our management services, we monitor over a dozen server health metrics.   By keeping track of things like swap usage, loads, mail queue depth and more, our team of sysadmin often spot issues before they become failures.

Should failures occur, our team can focus on fixing the issue rather than worrying about maintenance items.  This allows us to resolve most service outages in minutes.  We don’t have to stop and apply six months of OS updates to see if a known bug is the issue.

We highly recommend automating server management and maintenance.  If you cannot automate, then create a schedule and stick to it.  When we first started in this business over 10 years ago, we did a lot of things manually.  That works well for a few severs, but once you have dozens of systems to manage, you can miss things.  Tools like Nagios, New Relic, Pingdom, sysstat and many other open source and SaaS products can help you keep tabs on your servers.