Web Servers

Web Servers
Herng-Yow Chen
1
Outline



Survey many different types of software
and hardware web servers.
Describe how to write a simple diagnostic
web server in Perl.
Explain how web servers process HTTP
transactions, step by step.
2
Different types of web servers



General-purpose software web server
Web server appliances
Embedded web servers
3
Jobs of web servers


Implement HTTP and the related TCP
connection handling.
Manage the server-slide resource and
provide administrative features to configure,
control, and enhance the web service.
4
Jobs of Operating System




Manages the hardware details of the
underlying computer system
Provide TCP/IP network support
Provide filesystems to hold web resources
Provide process management to control
computing activities.
5
General-purpose software web
server




General-purpose software web servers run
on standard, network-enabled computer
system.
Open source software (such as Apache or
W3C’s Jigsaw).
Commercial software (such as Microsoft’s
and iPlanet’s web servers).
Web server software is available for just
about every computer and operating
systems.
6
General-Purpose Software Web Servers
In September 2004, the Netcaft survey (http://news.netcraft.com/archives/web_server_survey.html)
7
Web server appliances

Web server appliances are prepackaged
software/hardware solutions. The vendor preinstalls a
software server onto a vendor-chosen computer platform
and preconfigures the software.




Sun/Cobalt RaQ web appliance
(http://www.cobalt.com)
Toshiba Magnia SG10 (http://www.toshiba.com)
IBM Whistle web server application (http://www.whistle.com)
Appliance solutions remove the need to install and
configuration software and often greatly simplify
administration. However, the web server often is less
flexible, feature-rich, and the server hardware is not
easily upgradable.
8
Embedded web servers


Embedded servers are tiny web servers intended
to be embedded into consumer products (e.g.,
printers or home appliances).
Allow users to administer their consumer devices
using a convenient web browser interface.

IPic match-head sized web server


(http://www-ccs.cs.umass.edu/~shri/iPic.html)
NetMedia SitePlayer SP1 Ethernet web server

(http://www.siteplayer.com)
9
A Minimal Perl Web server


Type-o-serve – a minimal Perl web server
used for HTTP debugging
http://www.http-guide.com/tools/type-oserve.pl
10
A Minimal Perl Web Server
HTTP request message
GET /blah.txt HTTP/1.1
Type-o-serve dialog
% ./type-o-serve.pl 8080
Accept: */*
<<Request From 'www.csie.ncnu.edu.tw'>>
Accept-language: en-us
GET /blah.txt HTTP/1.1
Accept-encoding: gzip, deflate
Accept: */*
User-agent: Mozilla/4.0
Accept-language: en-us
Host: www.csie.ncnu.edu.tw:8080
Accept-encoding: gzip, deflate
Connection: Keep-alive
User-agent: Mozilla/4.0
Host: www.csie.ncnu.edu.tw:8080
Connection: Keep-alive
<<Type Response followed by '.’>>
HTTP/1.0 200 OK
Connection: close
HTTP response message
Content-type: text-plain
HTTP/1.0 200 OK
Connection: close
Hi there!
Content-type: text/plain
Hi there!
11
What do web servers do?
1.
2.
3.
4.
5.
6.
7.
Set up connection
Receive request
Process request
Access resource
Construct response
Send response
Log transaction
12
What Real Web Servers Do
User space
HTTP server software process
(3)Process request
(5)Create
response
(2)Receive request
(4)Access resource
(7) Log transaction
(1)Set up connection
client
(6)Send response
TCP/IP
networ
k stack
Network interface
Object Storage
Operating system
13
Step 1: accepting client connections

Handling new connections


Client hostname identification


Exacting client IP from a new TCP connection
Using “reverse DNS”
Determining the client user through ident

Some web servers support the IETF ident
protocol
14
Handling new connection



When a client requests a TCP connection to the
web server, the web server establishes the
connection and determines which client is on the
other side of the connection, extracting the IP
address from the TCP connection. (e.g., using
getpeername call in UNIX socket)
The server is free to reject and immediately close
connections, because the client IP is
unauthorized or is known malicious client.
Once a new connection is established and
accepted, the server adds the new connection to
its list of existing connections and prepares to
watch for data on the connection.
15
Client host identification




Most web servers can be configured to convert client IP
addresses into client hostnames, using “reverse DNS.”
The hostname information is used for detailed access
control and logging.
Note that hostname lookups can take a long time, slowing
down web transactions. Many high-performance web
servers either disable hostname resolution or enable it
only for particular content.
Ex: Configuring Apache to lookup hostnames for HTML
and CGI resources
HostnameLookups off
<Files ~ “\. (html | htm | cgi)$”>
HostanmeLookups on
</Files>
16
Determining the client user through
ident



The ident protocol let servers find out what
username initiated an HTTP connection.
The username information is particularly useful
for logging – the 2nd field of the popular
Common Log Format contains the ident
username of each HTTP request. (RFC931, the
updated ident specification is documented by
RFC 1413).
If a client supports the ident protocol, the client
listens on TCP port 113 for ident requests.
17
Determining the Client User
Through ident
(a) Mary establishes new HTTP connection
Port
4236
HTTP connection
4236, 80
Port
80
(c)Server sends request
(b)Server establishes ident connection
Mary
Port
113
ident connection
Port
80
Web server
4236, 80:USERID:UNIX:MARY
(d)Client returns ident response
18
Ident protocol (cont.)

Ident can work inside organizations, but it does
not work well across public Internet for the
following reasons.







Many client PC don’t run the identd identification protocol
daemon software.
The ident protocol significantly delays HTTP transactions.
Many firewalls won’t permit incoming ident traffic.
The ident protocol is insecure and easy to fabricate.
The ident protocol doesn’t support virtual IP address well.
There are privacy concerns about exporting client usernames.
Enable ident lookup in Apache


IdentityCheck on
Common Log Format log files typically contain typhens (-) in the
2nd filed if no ident information is available.
19
Step 2: Receiving request messages

As the data arrives on connections, the server
reads out the data and start parsing the request
message.





Parse the request line looking for the request method,
the specified URI, and the version number.
Read the message headers, each ending in CRLF.
Detects the end-of-headers blank line, ending in CRLF.
Reads the request body, if any (length specified by
Content-Length header)
Internet Representations of Messages

Some web servers also store the request message in
internal data structures that make the message easy
to manipulate.
20
Receiving Request Messages
Request message being read from network
GET /specials/hychen.gif HTTP/1.0CRLF
Accept: image/gifCRLF
Host: www.j
Internet
LF CR LF CR moc.erawdrah-seo
server
client
21
Internal Representations of Message
GET /specials/saw-blade.gif HTTP/1.0CRLF
Accept: image/gifCRLF
Host: www.joes-hardware.comCRLF
CRLF
Parse
method:
1
version:
1.0
uri:
●
specials/saw-blade.gif
header count:
2
www.joes-hardware.com
headers:
●
body:
-
Name:Host
Value: ●
Name:Accept
Value: ●
Image/gif
22
Different web server architectures



Single-threaded web servers
Multi-process and multi-threaded web
servers
Multiplexed I/O web servers


Non-blocking network accessing
Multiplexed multi-threaded web servers
23
Connection Input/Output Processing
Architectures
24
Step 3: Processing requests

Once the web server has received a
request, it can process the request using
method, resource, headers, and optional
body.

Some method (e.g., POST) require entity
body data in the request message. A few
methods (e.g., GET) forbid entity body data
in the request message.
25
Step 4: Mapping and Accessing
resources







Docroot
Virtually hosted docroots
User home directory docroots
Directory Listings
Dynamic content resource mapping
Server-Side Include (SSI)
Access Control
26
Docroots

Web servers support different kinds of resource mapping,
but the simplest form of mapping uses the request URI to
name a file in the web server’s filesystem.

Typically, a special folder in the web server filesystem is
reserved for web content. The folder is called the
document root, or docroot.

The web server takes the URI from the request message
and appends it to the document root. The docroot setting
in apache servers


DocumentRoot /usr/local/httpd/files
Servers must be careful not to let relative URLs back up
out of a document root and expose other parts of the
filesystem. E.g., http://www.csie.ncnu.edu.tw/../
27
Docroots
docroots
/usr/local/httpd/files
Internet
Request message
GET /specials/hychen.gif HTTP/1.0
Host: www.csie.ncnu.edu.tw
client
Request URI: /specials/hychen.gif
Object Storage
Web server
Server resource: /usr/local/httpd/files/specials/hychen.gif
28
Virtually hosted docroots

Virtually hosted web servers host multiple
web site on the same web server, giving
each site its own distinct document root on
the server.

A virtual hosted web server identifies the
correct document root to use from the IP or
hostname in the Host header.
29
Apache’s virtual host configuration

<VirtualHost www.joes-hardware.com>




ServerName www.joes-hardware.com
DocumentRoot /docs/joe
TransferLog /log/joe.access_log
ErrorLog /logs/joe.error_log

</VirtualHost>

<VirtualHost www.marys-hardware.com>





ServerName www.marys-hardware.com
DocumentRoot /docs/mary
TransferLog /log/mary.access_log
ErrorLog /logs/mary.error_log
</VirtualHost>
30
Virtually hosted docroots
Internet
Request message A
GET /index.html HTTP/1.0
Host: www.joes-hardware.com
/docs/joe /docs/mary
GET /index.html HTTP/1.0
client
Host: www.marys-antiques.com
Request message B
www.joes-hardware.com
www.marys-antiques.com
31
User home directory docroots
Request message A
GET /~bob/index.html HTTP/1.0
Internet
client
/home/bob/public_html
/home/betty/public_html
GET /~betty/index.html HTTP/1.0
Request message B
www.joes-hardware.com
www.marys-antiques.com
32
User home directory docroots




Another common use of docroots gives people private
web site on a web server.
A typical convention maps URIs whose paths begin with a
slash and tilde (/~) followed by a username to a private
document root for that user.
The private docroot is often the folder called public_html
inside that user’s home directory, but it can be configured
differently (e.g., in the NCNU web server, we use WWW
as the user’s private document root.)
In apache’s configuration,

UserDir public_html
33
Directory listings

A web serer can receive request for directory
URLs, where the path resolves to a directory, not
a file.

Most web servers can be configured to take a
few different actions when a client requests a
directory URL:



Return an error.
Return a special, default, “index file” instead of the
directory.
Scan the directory, and return an HTML page
containing the contents.
34
Directory Listings (continued)

Most web servers look for a file named
index.html or index.htm inside a directory to
represent that directory.

In apache configuration


DirectoryIndex index.html index.htm home.html
home.html index.cgi
Disable the automatic generation of directory
index files with the apache directive:

Option -Indexes
35
Dynamic content resource mapping

Web server also can map URIs to dynamic
resources – that is, to programs that generate
content on demand.

In fact, a whole class of web servers called
application servers connect web servers t
sophisticated backend applications.

The web server need to be able to tell when a
resource is a dynamic resource, where the
dynamic content generator program is located,
and how to runt he program.
36
Dynamic content …

In apache’s configuration



ScriptAlias /cgi-bin/ /usr/lcoal/etc/httpd/cgi-programs/
AddHandler cgi-script .cgi
CGI is an early, simple, and popular interface for
executing server-side applications. Modern
application servers have more powerful and
server-side dynamic content support, including
Active Server Pages, java servlets, and PHP.
37
Dynamic Content Resource Mapping
Internet
client
server
38
Server-Side Includes (SSI)




Many web servers also provide support for
server-side includes.
If a resource is flagged as containing server-side
includes, the server processes the resource
contents before sending them to the client.
The content are scanned for certain special
patterns, which can be variable name or
embedded scripts. The special patterns are
replaced with the values of variables or the
output of executable scripts.
This is an easy way to create dynamic content.
39
Access controls

Web servers also can assign access controls to
particular resource.

When a request arrives for an access-controlled
resource, the web server can control access
based on the IP address of the client, or it can
issues a password challenge to get access to the
resource.

We will see more details in the later lecture,
chapter 12 (HTTP authentication).
40
Step 5: Building Responses




Once the web server has identified the
resource, it performs the action described
in the request method and returns the
response message, which contains status
code, response header, and a response
body.
Response Entities
MIME Typing
Redirection
41
Response entities

If the transaction generated a response
body, the content is sent back with the
response message, which usually contains:



a Content-Type header, i.e. MIME typing
a Content-Length header, describing body size
The actual message body content
42
MIME typing


The web server is responsible for determining the
MIME type of the response body.
There are many ways to configure servers to
associate MIME types with resources:


mime.types: extension-based type association
Magic typing: content-based association, scanning a known
patterns

Explicit typing: force particular files or directory contents to
have a MIME types, regardless of the file extension or contents.

Type negotiation: server is configured to store a resource in
multiple document formats. In a client-server negotiation process
the server can determine the “best” format to use. (chapter17)
43
MIME Typing
HTTP request message contains
the command and the URI
hychen.gif file
GET /specials/hychen.gif HTTP/1.1
Host: www.csie.ncnu.edu.tw
HTTP/1.1 200 OK
client
www.csie.ncnu.edu.tw
Content-type: image/gif
Content-length: 8572
44
Redirection

Web servers sometimes return redirection
responses (indicated by a 3XX return code)
instead of success messages. The Location
response header contains a URI for the new or
preferred location of the content. Redirections
are useful for:






Permanently moved resources
Temporarily moved resources
URL augmentation
Load balancing
Server affinity
Canonicalizing directory names
45
300-399: Redirection Status Code

Status code
300
301
302
303
304
305
306
307
Reason Phrase
Multiple Choices
Moved Permanently
Found
See other
Not Modified
Use Proxy
(Unused)
Temporary Redirect
46
Step 6: Sending Responses




The servers may have many connections to many clients,
some idle, some sending data to the server, and some
carrying response data back to the clients.
The servers needs to keep track of connection state and
handle persistent connections with special care.
For non-persistent connections, the server is expected to
close its side of connection when the entire message is
sent.
For persistent connections, the connection may stay open,
in which case the server needs to be extra cautious to
compute the Content-Length header correctly, or the
client will have no way of knowing when a response ends
(c.f., Chapter 4).
47
Step 7: Logging

Finally, when a transaction is complete, the
web server notes an entry into a log file,
describing the transaction performed.

Most web servers provide several
configurable forms of logging. (Later
lectures, Chapter 21, for details)
48
Reference: Web server

http://www.apache.org


http://www.w3c.org/Jigsaw


The apache web site
Jigsaw- W3C’s Server
http://www.ietf.org/rfc/rfc1413.txt

RFC 1413, “Identification Protocol,” By M. St.
Johns.
49