improving web server reliability
A Pragmatic Guide
December 2002
introduction
This paper describes some of the issues
affecting reliability and performance of
applications served through the Internet environment.
It highlights typical sources of unreliability
and performance problems and shows pragmatic ways of dealing with these to
achieve improvements in end-user experience
without requiring disproportionate investment.
the challenge
Serving applications across the Internet uses a long chain of systems and services between
the server and ultimate end-user. The total
failure probability and latency in serving the application is the sum of delays and
failure modes on each of the links in the chain.
In most server architectures, the failure of
any one system or service in the path between
server and user will in effect cause failure of the
entire application as far as the user is concerned.
Managing this whole complex system to provide the desired level of service to an end user
can therefore seem a complex task. These issues
are further compounded because many parts of the system in the Internet as a whole are
seemingly out of the direct control of the
organisation providing the service.
web server issues
Although we simply use the term
"web-server", this can cover a wide range of
technologies. Ranging from a simple HTTP server which
holds only static content (.html files and images),
and serves these up across the Intranet or
intranet, through to complex interacting systems of
front end web-servers application middleware
servers and database systems.
Each one has it's own failure modes and sources of delay and unreliability. To understand
how to improve reliability and performance of the whole system we need to understand how
each component behaves.
Static Web Server
This is the most basic part of the service,
simply serving up files from a local disk in response
to incoming HTTP web requests.
Performance
This whole process is so simple that really
quite high request and data rates can be achieved with
a single relatively low powered PC based server. In our own lab testing, we have found it
possible to achieve rates of very nearly 1000 requests
per second, and data rates of up to 100Mb/s from a sub 1GHz PC based system running an
Apache web-server on an Open Source operating
system without any special tuning or optimisation.
This kind of performance is dependent on
factors such as the number of distinct files being requested and therefore the likelihood that
pages are being served from the filesystem cache
rather than needing one or more physical disk
accesses for each page.
Where a small number of distinct pages are frequently being served, for example a
single static web site with only a few Mb of
commonly accessed pages and images, the limiting
factors on throughput will be the CPU, memory,
network adapter and bus bandwidth. These are
typically large numbers relative to an average WAN
Internet connection, and only the most popular of
well configured small static web-sites will
experience performance problems related to server capacity.
Only where a large number of distinct static
pages are being concurrently served does
filesystem performance become the main limiting
factor. Examples of applications where this may
become an issue are shared servers hosting very
many distinct sites each with a fairly low hit rate, but a
very high aggregate hit rate being served from
all over the filesystem.
Reliability
With a static web-site, the basic factors
affecting reliability will be:
- Internet connectivity to the servers
- Stability of the server hardware systems
- Stability of the server software systems
- Environmental and power stability
within the hosting facility.
Each of the above needs to be considered when attempting to assure or improve the
overall reliability and resiliency of the service.
Active Content
Few web facilities consist of purely static web content. True web-applications which
interact with the user to receive and generate
information dynamically are now part of most
organisations' Internet strategy.
These sometimes quite complex configurations bring a far greater range of performance
and reliability challenges than the simple static
web server case.
Instead of content being immutably served directly from files, and out across the Internet,
the information is being generated by executing
code within the web-server. Often the code relies
on other internal database systems and quite frequently orchestrates the execution of
other code fragments within internal systems or middleware to deliver the final content to the
end user.
The performance and reliability issues in
active content serving systems are a superset of
those discussed earlier for a static web-server.
They are however typically much more challenging.
In particular, the reliability of the entire
system is degraded by the risk of failure of each of
the components individually, thus a very complex system consisting of front end scripting
engines, active middleware components (e.g. Java classes), and back-end database systems
has many failure modes and that need to be considered in it's design.
Performance is often much more of an issue in
active content systems too. The number of CPU
cycles required to deliver a moderately complex scripted page, or results of a Java
run-time execution, can be many orders of
magnitude greater than that required to serve a static
web page.
improving performance and reliability
In summary, the following factors affect the
reliability and performance of a typical active web-site:
- Reliability and capacity of network connectivity and on-site
network Infrastructure
- Stability of each separate
hardware function within the implementation architecture
- The sum of all failure modes in all of
the software subsystems within the implementation.
- Environmental and power stability
within the hosting facility.
Network Connectivity
Reliability of any network connection is a
function of the provider(s) used for that connection,
any resiliency or redundancy arrangements to
multiple providers, and network architecture from
the network ingress points through to the server facility.
Providers
Choice of underlying provider can obviously be
a large part of how reliable the Internet
connectivity into a facility proves to be. Whilst it is not
always easy to thoroughly vet and compare
suppliers in this area in a way which necessarily gives
and assurance that service will be without unreliability or lengthy interruptions, there
are a number of factors which help to understand how their business operates:
- Service Level Agreement: does the supplier's standard contract include
a contractual penalty for downtime or poor service.
- Typical reliability of their own
web-site and services.
- References from other customers
who have used them for a long period of time.
- Open information about backbone link utilisation levels or policy.
- Details of their network map and peering arrangements with
other providers.
SLA
Whilst a Service Level Agreement may mean that the provider has a focus on reliability
of your connection, it needs very careful evaluation to determine if it is
actually meaningful or realistic (most are not!).
Factors to look at in a SLA include:
Is sufficient compensation payable on a
breach that you would be properly compensated and
the supplier given strong incentives to engineer
their network to avoid any breach of SLA?
An SLA which doesn't have any significant penalties is pretty worthless.
Do the actual SLA terms match your
expectations for the service?
Remember that a 99.5% monthly reliability sounds good, but means your service
could be unavailable for over 3.5 hours perhaps during a business day, each and every
month, and the supplier would still be comfortably with the terms of the SLA.
Are there blanket exclusions for downtime
caused by third party failures?
Most service providers use numerous third parties to provide parts of their
actual underlying physical network infrastructure.
A high reliability service provider will provision it's circuits such that it isn't reliant on any
one of these providers, and a single failure will not disable large parts of it's network.
Other providers may choose to use a single
provider themselves with little or no redundancy
to save cost. In the latter case a customer
making a claim against an SLA due to one of these outages may be told that this is excluded
by a "third party" clause in the SLA.
Reliability of their own services
Often overlooked, but a good measure of a telecomms provider's competence
and reliability can be their own web service. If
the web site of a service provider is frequently very slow or unavailable then this
doesn't exactly inspire confidence in their ability
to manage a resilient network.
On the other hand it may mean that their network is fine, they just haven't read
papers like this to work out how to make their web servers more reliable.
Backbone utilisation
One of the factors affecting reliability and
performance of a service provider will be the degree to which
it invests in it's infrastructure to ensure that
backbone capacity always exceeds demand by a given
margin. Although, because of the excessive investment over
the last few years, most international and
intercontinental links have ridiculous over provision of physical
optical fibre strands, it is still an expensive exercise to
"light" these and provide additional network capacity.
The degree of "headroom" on backbone links shows
the degree to which the provider can deal with peaks
in demand and anomalous conditions (e.g. link
failure leading to traffic re-routing) without the
service degrading.
Some network providers make their live backbone utilisation graphs public, others regard these as a
trade secret (one can sympathise with that argument as
they also allow competitors to judge their customer
base and asset utilisation).
Using Multiple Providers
No Internet Service Provider will give 100%
reliability at all times, and even the largest has had
widespread multi-hour, and in some cases multi-day outages
at some stage in their history.
One way of insulating service provision from
supplier unreliability is to take service from several
suppliers at the same time. Properly configured and
managed, this can provide significant reliability benefits.
BGP Multi-homing
The traditional way of acquiring several paths to
the Internet for resilience is to buy multi-homed
access from more than one Internet provider. As a
customer you apply directly to a Regional Internet Registry for
a dedicated block of IP addresses which belong to
you rather than being assigned by either of your
providers. You also apply for an Autonomous System
Number (ASN) which in conjunction with the Border
Gateway Protocol (BGP) allows you to announce your
address space to the Internet via any number of providers.
BGP takes care of making sure that you stay reachable
if either your link to one of your providers, or indeed the entire provider fails.
BGP multi-homing seems like an ideal solution to the problem of
maintaining connectivity in the face of provider
failure. In effect it makes your data centre part
of the core of the Internet using exactly the same mechanisms that ISPs use in order
to route packets between themselves.
Unfortunately the BGP and Internet exterior routing technology generally was
intended to allow efficient routing of traffic for relatively small numbers of big blocks
of Internet addresses by ISPs.
All but the largest of companies will have difficulty justifying the large minimum sized blocks of
Internet addresses which can be allocated to an organisation,
and injected via BGP into the global Internet routing tables.
There is also an additional cost to operating reliably in a
multi-homed BGP environment, setup and maintenance of
BGP speaking routers in a truly default-free environment is
complex and expensive. The global routing table used in BGP is
currently around 100,000 routes. This means that relatively
expensive "carrier grade" IP routers with sufficient memory and CPU
to reliably operate in this environment are required.
A multi-homed infrastructure also requires expert
24x7 operations cover in order to function reliably. BGP routing
is complex and has failure modes which can be triggered
by events and changes in configuration on the wider Internet.
For example from time to time incidents can occur where
large numbers of additional routes are accidentally injected
into parts of the Internet. If affected by one of these, it is
possible for routers to run out of memory or CPU bandwidth and
become non-operational, or "flap" and cause other providers to
drop updates to your routes from their table. Unless you have
expert 24x7 cover to detect and remedy these sorts of problems
then it is quite likely that multi-homing will actually introduce
more unreliability through these additional failure modes than
it remedies through the possibility of surviving a single provider
failure.
Other multi-homing solutions
If BGP multi-homing is problematic for many
organisations, what other options exist for removing reliance on a
single provider?
Multiple external connections with IP addresses from
each provider can be used to provide many of the benefits of
multi homing without the cost, complexity and extra failure
modes of maintaining BGP routing.
It works by buying Internet service from several suppliers,
and getting a small IP address range assigned by each. On the
end of each provider circuit is a set of web servers and at least
one DNS server on addresses assigned by that provider.
Each of the DNS servers is listed as an authoritative server
for the zone of the web-servers, and returns answers with a
relatively short time to live. Instead of each holding identical copies
of the zone file as is normal, each is arranged so that it only
returns the IP address of it's own connection.
In the normal course of events, a client on the Internet will query either of the servers more
or less randomly (see "Round-robin DNS"
below) and wait for a reply. It will receive the
answer that the IP address to be used for the server
is the one on the same provider link as the DNS server.
In event of a link failure, queries to the DNS
server on the failed link will not return any result,
so the client will choose to query the other DNS server on the "good" link. This will result in
the IP address of the server on the working IP address being returned and all requests
being routed via the single working link.
There are some drawbacks with this approach. If the client queries a DNS server, gets the
IP address answer and starts using it, then the
link over which it is making requests fails, it
will continue to use that IP address for some time (practically at least around 15 minutes).
During that time new sessions from other clients
will correctly "find" the working link and not
be impacted by the failure but existing sessions
will see some problems. Whilst less than perfect,
this situation is better than a total loss of the
service whilst a single provider link is down and has
a fraction of the implementation and support costs of BGP multi-homing.
Geographic load balancing
Taking the above multi-homing approach to it's logical conclusion, it can be seen that there
is no requirement that both sets of Internet connectivity are in the same place. Indeed
it would add more resilience to the above scheme if two physical sets of DNS and web-servers
were in different locations, which would eliminate environmental factors (e.g. loss of power) in
one location as a failure mode, and provide built
in disaster recovery .
Unfortunately the simple architecture above has very limited ability to deal efficiently with
widely dispersed geographic locations. This is
because the web-server used depends only on which DNS server happens to get the first request
and answer it. There is some built in geographic
pre-selection in the way that some versions of the DNS software prefer servers which
answer quickly (i.e. are typologically close), but
this doesn't make much impact on the essentially random choice of initial DNS server. The
end result is that an organisation using the architecture to set up redundant server
facilities in, say, London and New York, would find
a significant proportion of it's European customers using web-servers in New York
and American customers accessing web-servers in London over slow transatlantic links.
The answer to this particular problem is the
use of geographic load balancers.
These devices replace many of the components of a multi-location architecture with a
single box at each location which is able to simultaneously work out how close a
client is to it's own location, and which of the
other remote locations it is peered up with is currently available.
The geographic load balancer also acts as DNS server for the web-server domain.
When a new request comes in, it determines which of the remote locations is available and
is closest to the DNS client. Rather than always returning it's own addresses, it then
returns the address of the closest available
peered facility. In this way, redundancy is
achieved, but client's are also likely to always be sent
to the server which has the quickest network links to them so long as it is available.
An excellent example of a cost effective geographic load balancer is the
Envoy device from Coyote Point Systems
.
Web Server Reliability
With the reliability of the Internet
connectivity taken care of, the next aspect to look at is
the reliability of the server architecture itself.
As previously discussed, simple static web-servers have few problems coping with
heavy traffic loads. When serving static web-pages capacity and reliability can generally
be improved simply by adding additional servers which accept requests in parallel.
There are a number of mechanisms available to allow request traffic to be shared
among web-servers.
Round Robin DNS
The most simple form of sharing connections between a cluster of servers giving
multiple IP addresses to the primary web server name.
Because of the way that the DNS operates, if the number of connections over time is
large then statistically the number of requests to each server will be approximately equal.
There are however two major problems with this approach:
- It takes no account of server load, so
a slower server will receive the same number of connections as a
higher capacity device.
- It behaves poorly under server
failure with browser timeouts etc because requests still go to the failed server.
This tends to limit it usefulness to all but
the most simple applications.
Load Balancing
The next step is to put a device in front of
the web-server which actively brokers the connections and farms them out to
servers which are able to accept them. The load
balancer periodically places requests on all of the servers in
a cluster to determine which ones are available. In this
way, servers can be taken out for maintenance, or in case
of failure without impacting the service.
Load balancers typically also use a range of
factors including server response time to measure which
servers are giving the best service at any point in time
and therefore place connections with the server which is
most likely to give the best response. There is usually a
damping or adaptive algorithm in place to ensure that server
load does not oscillate.
Some systems also include the ability to have the
servers themselves tell the load balancer how loaded it is by
use of a server agent.
Load balancer issues
A minor issue with introducing a load balancer into
the front end of a web farm is that it can become a single
point of failure itself.
Whilst they are simple devices with fewer failure
modes than say a full web-server, if load balancers are
being deployed to improve reliability, it is best to consider
the implementations allow you to deploy them in a
redundant configuration. By using a pair of load balancers in a
fail over configuration, the possibility of service
impacting failures of the load balancers are reduced.
Dynamic Content
When serving dynamic content, it can be even
more important to deploy load balancers in order to
bypass failed servers and improve performance. There are
some additional considerations in this environment.
Content Verification
One of the problems with complex web service
which involve running code on the server and
intermediate systems can be that the software as well as hardware
fails in a less than clean way. It is quite common with
certain active web servers that the server software
and/or operating system fails so that the server continues
to accept HTTP requests and makes valid responses, but
the pages are empty or corrupted.
This presents a problem for some load balancers
which will continue to use the failed server and allow
corrupt pages to be served to some proportion of users via
that server.
The better load balancers include Active
Content Verification in which the load balancer
periodically places a specific request on the server and ensures
that the response is as expected as part of it's checks that
the server is operational.
This can be used in conjunction with an active server
to ensure that the load balancer actually makes it's
request for a specific diagnostics page which checks all
aspects of the software systems on the server. If any aspect
fails then the diagnostics page should return a result
which causes the load balancer to drop it from the active cluster.
User sessions and context
So far when talking about load balancing, we have
assumed that each web request is standalone and can simply
be farmed out to whichever server is most convenient.
Whilst this model certainly works for static web pages,
it can be a problem if certain kinds of user tracking is in
use to store session information like shopping basket
contents. If the user session data is stored within the local server
then the next page load from the user may be directed by
the load balancer to a different server which has no record
of that user.
This problem can be solved in two ways, both of which
have different impacts on application achitecture
and performance:
- Place the session context in a separate store
or database which is accessible to all the front-end
web servers.
- Use "sticky sessions" on the load balancer to cause
it to attempt to always direct the user back to the same
web server in the cluster.
Implemented properly, the former approach can give
the best scaleability as the load balancer is free to make
optimal decisions about directing each request to the
most appropriate web-server at that time.
The latter sticky connection approach can degrade
load balancing performance because it now needs to
keep significant persistent information about user sessions,
and is constrained in load balancing decisions it makes.
It does however have the advantage that no changes
are necessary to the application architecture where
systems are currently used which store session information
locally. This is especially important where scripting languages
like Active Server Pages on the Microsoft® IIS server have
been used. The built in Session object within ASP stores
the session context within the memory of the web-server
and therefore isn't amenable to operation on multiple
we-servers.
Back-end Systems
In our experience it is always preferable duplicate
and load balance inexpensive components within a
server facility in order to deliver the best reliability
and performance in a cost-effective way.
There are however in many application
architecture components where machine level hardware and
software redundancy just isn't possible.
For example in a customer transaction system, there
can be any number of web-servers which are delivering
HTML by running scripting engines or middleware
applications. There can however be only one customer database
which is being referenced by all of the application servers. It
is operationally impossible to have them each
updating different copies of this database held on parallel hardware.
This is the point where traditional enterprise
data-centre principles need to be used to ensure high point
reliability on these critical components.
High Availability Hardware
A full description of the range and application of
HA systems is well beyond the scope of this document, but
as a pragmatic guide, there are a few points that it is
worth making here.
Firstly, HA systems can range from basic
hardware redundancy arrangements within a single server RAID
disk array, multiple redundant power supplies etc), through
to incredibly complex and expensive "non-stop" clusters
of multiple CPUs storage area networks etc. The
key characteristic they all share is that there is a degree
of hardware redundancy and "hot-swap" capability in
critical hardware components. Because of this, failure
or maintenance of those components do not
necessitate taking the system out of production.
It is crucially important when planning HA systems
to understand the degree of up-time required, but also
how this compares to other parts of the system and
software implementation.
For example it is possible to spend very large sums
on complex very high availability clusters in the
expectation of a near zero downtime probability. If it is then
necessary to take entire system out of production several times
during it's life to correct software problems or perform
upgrades, the end effect is that the difference in availability
delivered by the HA hardware was minuscule in the context of
total downtime.
A more cost effective solution may have been to
deploy basic high reliability systems which pre-empt most of
the failure modes and then put more focus into
redundancy of software systems.
One configuration which we have found works very
well over time, particularly for database systems, is to
deploy basic high reliability servers which have redundant
hot-swap hardware covering most common failure modes.
This means RAID1 mirrored disk arrays, multiple power
supplies etc. Then use software redundancy techniques to
maintain hot-stand-by systems (e.g. database replication). This
often gives a very high total availability of the system
function without astronomic hardware costs.
Processes and Procedures
No analysis of server reliability issues is complete
without a discussion of the pivotal role that the quality of
the system administration processes plays.
This is important in all parts of the server facility, but
is especially important in areas like HA systems where
the are no parallel servers to take the load if a
system administration procedure inadvertently removes
a component from service.
It is crucially important to ensure that all staff who
may be required to perform administration procedures
are well trained on the detail of both periodic
maintenance and diagnosis and rectification of problems that they
may encounter.
It is also important that local and vendor
procedures documentation for maintenance operations are
both available at all times and regularly updated
and practiced from time to time in a non-operationally
critical environment.
Whilst the above may seem completely obvious it is
the neglect particularly of the latter principles which
most often actually causes downtime.
Any analysis of service failures, will very often come
across many instances where a simple non service
impacting failure was then followed up by an operator
intervention which through lack of experience, documentation, or
an un-predicted side effect actually caused an outage.
A recent example which actually affected one of
our customers was an incident where a disk failed in a
RAID1 redundant array on a critical database server. A
hot-spare disk was immediately re-mapped to replace it and
the server continued run in the meantime whilst
re-mirroring restored it's redundancy. The failed disk was returned
to the manufacturer who sent a replacement a few
days later.
The returned disk was inserted into the server
cabinet, but for safety the very experienced operator decided
to tell the RAID controller to perform a "surface scan" of
the disk before assigning it back to a RAID array.
Unfortunately an undocumented side effect of the "surface scan" option on the particular software
revision of the RAID controller was that it locked out the
controller from the host until the operation completed and was
not interruptible. The production service was now down!
There are a number of principles exposed by the
above example:
Keep operations on production systems to an
absolute minimum
The surface scan, if required at all, shouldn't have
been performed on a live system, but back in the lab on a
test system.
Don't perform operations on a live system unless you
are completely sure of the consequences and are following
a documented procedure.
If the surface scan operation had been tested under lab conditions with an identical
system, it would have become apparent that
damaging side effect existed.
Physical Location
Reliability factors associated with a physical location fall roughly speaking into
three categories:
- Reliability of power supplies
- Environmental factors (excess heat, water, dust ingress etc)
- Physical damage risks to equipment (including communications links)
Physical location factors are often the most complicated and expensive to address.
Except for simple improvements like UPS, or taking physical factors into consideration
when placing a server room in an existing
building, it is likely that most organisations will
either use secure co-location facilities or seek extensive professional advice as part
of planning organisation wide computing facilities.
The following information is therefore tailored to a simple overview which is of use
when determining suitability of existing facilities
or quality of co-location space.
Power Supplies
Due to the possibility of interruption in
utility electricity supply, it is usual to
incorporate some degree of local power supply into
a server facility..
Often this takes the form of uninterruptable power supply (UPS) which provides cover
for short supply interruptions (measured in minutes). It can also be used to allow
an orderly shutdown of services to avoid data loss in the event of a longer outage, but
isn't normally capable of providing protection against significant outages in utility power.
Local generators are the only way that longer outages can be effectively countered.
Consideration should be given to the number of days of local fuel supply carried at
the location and the strength of priority
re-supply contracts in the event of local disasters.
Environmental Factors
Excess heat, water, and foreign matter ingress (dust etc) are all factors that can
cause catastrophic failure of a server facility.
Dealing thermal hazards involves ensuring that sufficient air conditioning capacity
is present to cope with the heat dissipation of the equipment, and thermal gains from
the outside environment in the most extreme conditions envisaged. It is also important
to ensure there is redundant capacity in the air conditioning systems to cope with
the failure of parts of the system without the building temperature rising outside
an acceptable band.
It is essential in any facility which is designed to run for more than a few
minutes in the absence of utility power that there
is sufficient air-conditioning capacity connected to the emergency power
supply.. If not then rising temperatures will
result in equipment damage or shutdown after a surprisingly short period of
running (typically less than 30 minutes).
Water risks fall into two categories:
- External water ingress (leaking
flat roof or exterior walls, flooding into ground floor or basement facilities)
- Local flooding from building water or waste water installations.
Whilst the former is obviously part of the building design and maintenance,
often particularly in basements there is a
reliance on active equipment (sump-pumps) as a primary or secondary mechanism
for preventing water ingress. If these devices are in use then it is important that
they provided with emergency power supplies to ensure their continued operation
during utility power failure (which often occurs during storms!).
Wherever possible equipment should be located so that it is not directly under
any water bearing building services (including air-conditioning plant), and that
water which find's it's way into floor or ceiling voids cannot drain into equipment,
power or network cabling.
Physical Damage and Security
Security issues affecting service provision is a topic which is worthy of
separate treatment in it's own right and will be
the subject of a later RPANetwork guide in
this series.
|