Disaster Recovery

Planning for the business continuity of Daemen College in the aftermath of a disaster is a complex task. Preparation for, response to, and recovery from a disaster affecting the administrative functions of the Institute requires the cooperative efforts of many support organizations in partnership with the functional areas supporting the “business” of Daemen.

Purpose
Daemen increasingly depends on computer-supported information processing and telecommunications. This dependency will continue to grow with the trend of utilizing information technology within individual organizations of Daemen administration and throughout the campus.

The increasing dependency on computers and telecommunications for operational support poses the risk that a lengthy loss of these capabilities could seriously affect the overall performance of the college.

Daemen’s administration recognizes the low probability of severe damage to data processing telecommunications or support services capabilities that support the college. Nevertheless, because of the potential impact to Daemen, a plan for reducing the risk of damage from a disaster however unlikely is vital. The college’s plan is designed to reduce the risk to an acceptable level by ensuring the restoration of critical processing first and then to all non-essential production systems.

The plan identifies the critical functions of Daemen and the resources required to support them. The plan provides guidelines for ensuring that needed personnel and resources are available for both disaster preparation and response and that the proper steps will be carried out to ensure the timely restoration of services.

Assumptions
The Plan is predicated on the validity of the following assumption. That the situation that causes the disaster is localized to the main data processing facilities of IT or to the communication systems and networks that support the functional area. It is not a general disaster, such as an earthquake or the “Blizzard of ’77,” affecting a major portion of the area.

It should be noted, however, that the Plan will still be functional and effective even in an area-wide disaster. Even though the basic priorities for restoration of essential services to the community will normally take precedence over the recovery of an individual department, the plan can still provide for a more expeditious restoration of our resources for supporting key functions.

Disaster Response
This section describes four required responses to a disaster, or to a problem that could evolve into a disaster:

  1. Detect and determine a disaster condition
  2. Notify persons responsible for the recovery
  3. Disseminate Public Information
  4. Initiate the Recovery Plan

Disaster Detection and Determination
The detection of an event which could result in a disaster affecting information processing systems at Daemen is the responsibility of the Maintenance Department, Campus Safety, Information Technology, or whoever first discovers or receives information about an emergency situation developing in an area housing major information processing systems or about the communications lines between these buildings. The first responder will assume the role of the incident commander.

Disaster Notification
Maintenance and/or Campus Safety will follow existing procedures and notify the individuals who are responsible for recovery from the disaster. In most cases, this will be the Chief Information Officer and the Director of Information Management. These people will then notify additional staff members as needed.

Dissemination of Public Information
Depending on the type of disaster the Chief Information Officer will communicate to the campus community the extent of the problem and what steps are being taken to remedy it. Communication modes may include email, web and social media announcements, telephone messages, and potentially print materials if electronic communication is unavailable.

This may potentially include the use of the college’s emergency contact information system. This system is outsourced via Regroup so it should be immune to any emergencies that occur regionally in WNY.

Disaster Recovery
The disaster recovery strategy explained below pertains specifically to a disaster disabling the main data centers. This functional area provides major server support to Daemen’s administrative applications. Especially at risk are the critical applications required to support the college’s administrative computing system (Ellucian).

There are three phases to the plan. They are:

Emergency Phase
The emergency phase begins with the initial response to a disaster. During this phase, the existing emergency plans and procedures of the Campus Safety and Maintenance Department direct efforts to protect life and property, the primary goal of the initial response. Security over the area is established as local support services such as the Police and Fire Departments are enlisted through existing mechanisms.

If the emergency situation appears to affect the main data center (or other critical facility or service), either through damage to data processing or support facilities, or if access to the facility is prohibited, the staff member in charge will closely monitor the event, notifying other personnel as required to assist in damage assessment. Once access to the facility is permitted, an assessment of the damage is made to determine the estimated length of the outage. If access to the facility is precluded, then the estimate includes the time until the effect of the disaster on the facility can be evaluated.

Damage Assessment
During this phase, a quick assessment of the damage to all priority systems will occur. The main goal is to determine whether enough damage has occurred to the system to require replacement of the equipment or can it be repaired with readily available parts.

In addition, we also need to determine the state of the environmental systems that support these systems. If the air conditioning systems aren’t functioning then the systems may not be able to be restarted (depending on the time of the year and the outside temperature).

Recovery Phase
The time required for recovery of the functional area and the eventual restoration of normal processing depends on the damage caused by the disaster. The time frame for recovery can vary from several hours to several weeks for a full recovery. The primary goal is to restore normal operations as soon as possible.

If the systems are so damaged as to require a replacement then an emergency call will be placed to our Dell sales representative. All servers are covered under yearly support contracts with either Dell or Park Place Technologies.

The systems can be classified in 3 main levels

Priority
These are the servers required for day-in-day-out processing by the college’s administrative offices. Included in here are the main servers required for minimal Ellucian functionality and the main administrative files server (though this is at a slightly lower level). Critical servers include the main Active Directory Server, Colleague SQL, and Application Servers, Webadvisor and WebUI.  Dell EqualLogic Storage Array’s, MyDaemen and VMware host(s). Also, a priority is any networking equipment required to support campus communications including the edge router, firewall, and Aruba controller.

Required
These systems are required for the full functioning of the administration but can be repaired after the main systems have been restored. Included in here are secondary Ellucian systems such as Self Service servers that are used for online payments, the College’s external-facing website, and the College’s MyDaemen portal.

Auxiliary
The remaining systems are either used intermittently or of a much lower priority. They include the Conference / Events, File Server (former Academic Computing), Laserfiche servers, Papercut Print Management, Symantec Endpoint Protection.

Recovery Process
The recovery process for servers is generally a two-step process though it can differ based on the amount of damage. The steps detailed below presume that the server has been damaged enough that a full recovery is required.

Step one is to configure the hardware of the new server so that it as similar as possible to the old server.  Once the hardware is prepared, the operating system is installed on the new server and then patches published by the manufacturer are downloaded and installed.

Once the operating system has been installed the restoration of the data begins (step two).

Critical servers are fully backed up every night onto separate server storage. On every last Wednesday of the month, backups are enclosed in a locked metal case and taken to Iron Mountain, a secure storage facility in Rochester.

i. Server Data Backup Plans

  1. Backup Locations:
    • HNAS (Canavan Server Room)
    • OffSiteDataSync (10TB – space)
  2. Backup Schedule:
    • Nightly backups on all primary production servers
    • Daily/Nightly/Monthly – Virtual Machines
  3. Backup Media:
    • HNAS
  4. Backup Encryption:
    • Veeam Backups encrypted both locally and when synced to the cloud.
  5. Backup Software:
    • Windows Server Backup (Physical Servers)
    • Veeam Agent (Windows Servers – 10 agents)
    • Veeam Backup & Replication (Virtual Machines)
  6. Retention Policy:
    • Minimum 21-day backups on all production servers
    • Weekly & Monthly system images
    • Hourly, Daily, and Weekly Snapshots on Hitachi (HNAS only)
  7. Warranty Information:
    • Dell 24x7x4 Pro Support
    • Hitachi 24×7 with next business day parts
    • Park Place Technologies supports out of warranty servers with 24x7x4 or next business day service depending on the server.
      • Current Backups total to 30TB

ii. Network Redundancy/Disaster/Support Information

  • Redundant Internet Service:
    We have two fiber pathways connecting the campus to the Internet. One pathway is connected to UB Fiber and the other is Crown Castle. We currently have a 1Gbps Cogent commodity connection running on the UB fiber, and a 1Gpbs Level3 commodity connection running on the Crown Castle fiber. Diversified paths have been configured on our core edge router to utilize both connections under normal circumstances. Automatic failover has been configured on our core edge router to divert all traffic to one pathway when and if the other fails.
  1. Enhanced Support:
    • Edge Router, Firewall, Switches
    • 24×7 phone support for all listed devices
    • 4-hour hardware replacement/repair on main 8164/8132’s used for edge switch.
    • 4-hour hardware replacement/repair on Cisco ASR1002x.
    • E-CLASS support through Sonicwall: provides overnight hardware replacement/repair.
  2. Dynamic DNS Failover:
    • DNS is a process happening in the background that turns user-friendly URL’s (ex. my.daemen.edu) that you type into web browsers and resolves them to IP addresses (ex. 198.22.176.4).
    • If our internet to campus goes down or the website server malfunctions our website would normally appear to be down.
    • Our failover server will detect if our internet or website is unreachable and automatically route users to a backup page explaining the particular situation causing the outage. To the end-user, this is a seamless process.
  3. Configuration, Logs, and Flash Backups
    • Running Configurations and flash backups are taken before and after firmware updates, configuration changes, and other setting configurations on network devices and monitoring systems.
    • Airwave, Dell OpenManage.  These are virtual machines that are backed up nightly by our system administrator.
  4. Aruba Wireless Controller (Aruba7205)
    • We have 24×7 phone support on all Aruba owned products
    • We have a master/local controller configuration using LMS option to provide redundancy to access points on our Academic Network. We now have a third 7205 controller for the residential network AP’s.
  5. DHCP Servers
    • We have three DHCP servers currently in place: Administrative, Academic & Residential
    • Windows Server 2016
    • We assume the risk if campus fiber is damaged
    • Link aggregation is used on most inter-building connections providing a low-level redundancy in case of port failure, light loss, optics failure, etc.
    • Dark fiber (OM1) exists in separate underground conduits.
    • 3 pairs of dark fiber (OM1) provides alternative paths between WICK & SCHENCK. This provides potential redundancy if the direct path from CANAVAN->DUNS SCOTUS or CANAVAN->SCHENCK are damaged.
    • Through the use of newer layer 3 switches, we have the plan to these “cold standbys” into “hot standbys” utilizing more advanced routing protocols.
    • Cold Spares for some switch and access point equipment is kept on hand.