Internet Protocol (IP)

Internet
Internet is defined as an Information super Highway, to access information
over the web. However, It can be defined in many ways as follows:

Internet is a world-wide global system of interconnected computer networks.

Internet uses the standard Internet Protocol (TCP/IP).

Every computer in internet is identified by a unique IP address.

IP Address is a unique set of numbers (such as 110.22.33.114) which identifies a
computer location.

A special computer DNS (Domain Name Server) is used to give name to the IP
Address so that user can locate a computer by a name.

For
example,
a
DNS
server
will
resolve
a
name http://www.tutorialspoint.com to a particular IP address to uniquely
identify the computer on which this website is hosted.

Internet is accessible to every user all over the world.
1
Evolution
The concept of Internet was originated in 1969 and has undergone several
technological & Infrastructural changes as discussed below:

The origin of Internet devised from the concept of Advanced Research Project
Agency Network (ARPANET).

ARPANET was developed by United States Department of Defense.

Basic purpose of ARPANET was to provide communication among the various
bodies of government.

Initially, there were only four nodes, formally called Hosts.

In 1972, the ARPANET spread over the globe with 23 nodes located at different
countries and thus became known as Internet.

By the time, with invention of new technologies such as TCP/IP protocols, DNS,
WWW, browsers, scripting languages etc.,Internet provided a medium to publish
and access information over the web.
2
Advantages
Internet covers almost every aspect of life, one can think of. Here, we will
discuss some of the advantages of Internet:

Internet allows us to communicate with the people sitting at remote locations.
There are various apps available on the wed that uses Internet as a medium for
communication. One can find various social networking sites such as:
o Facebook
o Twitter
o Yahoo
o Google+
o Flickr
o Orkut
3

One can surf for any kind of information over the internet. Information regarding
various topics such as Technology, Health & Science, Social Studies, Geographical
Information, Information Technology, Products etc can be surfed with help of a
search engine.

Apart from communication and source of information, internet also serves a
medium for entertainment. Following are the various modes for entertainment
over internet.
o Online Television
o Online Games
o Songs
o Videos
o Social Networking Apps

Internet allows us to use many services like:
o Internet Banking
o Matrimonial Services
o Online Shopping
o Online Ticket Booking
o Online Bill Payment
o Data Sharing
o E-mail

Internet provides concept of electronic commerce, that allows the business
deals to be conducted on electronic systems
Disadvantages
4
However, Internet has prooved to be a powerful source of information in
almost every field, yet there exists many disadvanatges discussed below:

There are always chances to loose personal information such as name, address,
credit card number. Therefore, one should be very careful while sharing such
information. One should use credit cards only through authenticated sites.

Another disadvantage is the Spamming.Spamming corresponds to the unwanted
e-mails in bulk. These e-mails serve no purpose and lead to obstruction of entire
system.

Virus can easily be spread to the computers connected to internet. Such virus
attacks may cause your system to crash or your important data may get deleted.

Also a biggest threat on internet is pornography. There are many pornographic
sites that can be found, letting your children to use internet which indirectly
affects the children healthy mental life.
5

There are various websites that do not provide the authenticated information. This
leads to misconception among many people.

Packet Switching

Shortcomings of message switching gave birth to an idea of packet switching. The entire
message is broken down into smaller chunks called packets. The switching information is
added in the header of each packet and transmitted independently.

It is easier for intermediate networking devices to store small size packets and they do not
take much resources either on carrier path or in the internal memory of switches.


Packet switching enhances line efficiency as packets from multiple applications can be
multiplexed over the carrier. The internet uses packet switching technique. Packet
switching enables the user to differentiate data streams based on priorities. Packets are
stored and forwarded according to their priority to provide quality of service.
Internet Domain Name System
Advertisements
Previous Page
Next Page
6
Overview
When DNS was not into existence, one had to
file containing host names and their corresponding IP
increase in number of hosts of internet, the size of host
This resulted in increased traffic on downloading this
problem the DNS system was introduced.
download a Host
address. But with
file also increased.
file. To solve this
Domain Name System helps to resolve the host name to an address. It
uses a hierarchical naming scheme and distributed database of IP addresses
and associated names
IP Address
IP address is a unique logical address assigned to a machine over the
network. An IP address exhibits the following properties:

IP address is the unique address assigned to each host present on Internet.

IP address is 32 bits (4 bytes) long.

IP
address
consists
of
two
components: network
component and host
component.

Each of the 4 bytes is represented by a number from 0 to 255, separated with
dots. For example 137.170.4.124
IP address is 32-bit number while on the other hand domain names are easy to
remember names. For example, when we enter an email address we always enter a
symbolic string such as [email protected].
Uniform Resource Locator (URL)
Uniform Resource Locator (URL) refers to a web address which uniquely
identifies a document over the internet.
This document can be a web page, image, audio, video or anything else present on the
web.
For
example, www.tutorialspoint.com/internet_technology/index.htmlis
7
an URL to the index.html which is stored on tutorialspoint web server under
internet_technology directory.
URL Types
There are two forms of URL as listed below:
1. Absolute URL
2. Relative URL
ABSOLUTE URL
Absolute URL is a complete address of a resource on the web. This completed
address comprises of protocol used, server name, path name and file name.
For example http://
/index.htm. where:
www.tutorialspoint.com

http is the protocol.

tutorialspoint.com is the server name.

index.htm is the file name.
/
internet_technology
The protocol part tells the web browser how to handle the file. Similarly we
have some other protocols also that can be used to create URL are:

FTP

https

Gopher

mailto

news
RELATIVE URL
Relative URL is a partial address of a webpage. Unlike absolute URL, the
protocol and server part are omitted from relative URL.
Relative URLs are used for internal links i.e. to create links to file that are part of same
website as the WebPages on which you are placing the link.
8
For
example,
to
link
an
image
on
tutorialspoint.com/internet_technology/internet_referemce_models, we can
use
the
relative
URL
which
can
take
the
form
like /internet_technologies/internet-osi_model.jpg.
Difference between Absolute and Relative URL
Absolute URL
Relative URL
Used to link web pages on different
websites
Used to link web pages within the same
website.
Difficult to manage.
Easy to Manage
Changes when the server name or
directory name changes
Remains same even of we change the server
name or directory name.
Take time to access
Comparatively faster to access.
Domain Name System Architecture
The Domain name system comprises of Domain Names, Domain Name
Space, Name Server that have been described below:
Domain Names
Domain Name is a symbolic string associated with an IP address. There are
several domain names available; some of them are generic such as com,
edu, gov, net etc, while some country level domain names such as au, in,
za, usetc.
The following table shows the Generic Top-Level Domain names:
Domain Name
Meaning
Com
Commercial business
Edu
Education
9
Gov
U.S. government agency
Int
International entity
Mil
U.S. military
Net
Networking organization
Org
Non profit organization
The following table shows the Country top-level domain names:
Domain Name
Meaning
au
Australia
in
India
cl
Chile
fr
France
us
United States
za
South Africa
uk
United Kingdom
jp
Japan
es
Spain
de
Germany
ca
Canada
ee
Estonia
10
hk
Hong Kong
Domain Name Space
The domain name space refers a hierarchy in the internet naming structure.
This hierarchy has multiple levels (from 0 to 127), with a root at the top. The
following diagram shows the domain name space hierarchy:
In the above diagram each subtree represents a domain. Each domain can
be partitioned into sub domains and these can be further partitioned and so
on.
Name Server
Name server contains the DNS database. This database comprises of various
names and their corresponding IP addresses. Since it is not possible for a
single server to maintain entire DNS database, therefore, the information is
distributed among many DNS servers.

Hierarchy of server is same as hierarchy of names.

The entire name space is divided into the zones
Zones
Zone is collection of nodes (sub domains) under the main domain. The server
maintains a database called zone file for every zone.
11
If the domain is not further divided into sub domains then domain and zone refers to
the same thing.
The information about the nodes in the sub domain is stored in the servers
at the lower levels however; the original server keeps reference to these
lower levels of servers.
TYPES OF NAME SERVERS
Following are the three categories of Name Servers that manages the entire
Domain Name System:
1. Root Server
2. Primary Server
3. Secondary Server
ROOT SERVER
Root Server is the top level server which consists of the entire DNS tree. It
does not contain the information about domains but delegates the authority
to the other server
12
PRIMARY SERVERS
Primary Server stores a file about its zone. It has authority to create,
maintain, and update the zone file.
SECONDARY SERVER
Secondary Server transfers complete information about a zone from another
server which may be primary or secondary server. The secondary server does
not have authority to create or update a zone file.
DNS Working
DNS translates the domain name into IP address automatically. Following
steps will take you through the steps included in domain resolution process:

When we type www.tutorialspoint.com into the browser, it asks the local DNS
Server for its IP address.
Here the local DNS is at ISP end.

When the local DNS does not find the IP address of requested domain name, it
forwards the request to the root DNS server and again enquires about IP address
of it.

The root DNS server replies with delegation that I do not know the IP address
of www.tutorialspoint.com but know the IP address of DNS Server.

The local DNS server then asks the com DNS Server the same question.

The com DNS Server replies the same that it does not know the IP address of
www.tutorialspont.com but knows the address of tutorialspoint.com.

Then the local DNS asks the tutorialspoint.com DNS server the same question.

Then
tutorialspoint.com
DNS
server
replies
with
IP
address
of
www.tutorialspoint.com.

Now, the local DNS sends the IP address of www.tutorialspoint.com to the
computer that sends the request.
13
Internet Services
Advertisements
Previous Page
Next Page
Internet Services allows us to access huge amount of information such as
text, graphics, sound and software over the internet. Following diagram
shows the four different categories of Internet Services.
Communication Services
There are various Communication Services available that offer exchange of
information with individuals or groups. The following table gives a brief
introduction to these services:
S.N. Service Description
1
Electronic Mail
Used to send electronic message over the internet.
2
Telnet
Used to log on to a remote computer that is attached to internet.
3
Newsgroup
Offers a forum for people to discuss topics of common interests.
14
4
Internet Relay Chat (IRC)
Allows the people from all over the world to communicate in real time.
5
Mailing Lists
Used to organize group of internet users to share common information
through e-mail.
6
Internet Telephony (VoIP)
Allows the internet users to talk across internet to any PC equipped to receive
the call.
7
Instant Messaging
Offers real time chat between individuals and group of people. Eg. Yahoo
messenger, MSN messenger.
Information Retrieval Services
There exist several Information retrieval services offering easy access to
information present on the internet. The following table gives a brief
introduction to these services:
S.N. Service Description
1
File Transfer Protocol (FTP)
Enable the users to transfer files.
2
Archie
It’s updated database of public FTP sites and their content. It helps to search
a file by its name.
3
Gopher
Used to search, retrieve, and display documents on remote sites.
4
Very Easy Rodent Oriented Netwide Index to Computer Achieved
(VERONICA)
VERONICA is gopher based resource. It allows access to the information
resource stored on gopher’s servers.
Web Services
15
Web services allow exchange of information between applications on the web.
Using web services, applications can easily interact with each other.
The web services are offered using concept of Utility Computing.
World Wide Web (WWW)
WWW is also known as W3. It offers a way to access documents spread over
the several servers over the internet. These documents may contain texts,
graphics, audio, video, hyperlinks. The hyperlinks allow the users to navigate
between the documents.
Video Conferencing
Video conferencing or Video teleconferencing is a method of communicating
by two-way video and audio transmission with help of telecommunication
technologies.
Modes of Video Conferencing
POINT-TO-POINT
This mode of conferencing connects two locations only.
MULTI-POINT
This mode of conferencing connects more than two locations through Multipoint Control Unit (MCU).
16
Internet Protocols
Advertisements
Previous Page
Next Page
Transmission Control Protocol (TCP)
TCP is a connection oriented protocol and offers end-to-end packet delivery.
It acts as back bone for connection.It exhibits the following key features:

Transmission Control Protocol (TCP) corresponds to the Transport Layer of OSI
Model.

TCP is a reliable and connection oriented protocol.
17

TCP offers:
o Stream Data Transfer.
o Reliability.
o Efficient Flow Control
o Full-duplex operation.
o Multiplexing.

TCP offers connection oriented end-to-end packet delivery.

TCP ensures reliability by sequencing bytes with a forwarding acknowledgement
number that indicates to the destination the next byte the source expect to
receive.

It retransmits the bytes not acknowledged with in specified time period.
TCP Services
TCP offers following services to the processes at the application layer:

Stream Delivery Service

Sending and Receiving Buffers

Bytes and Segments

Full Duplex Service

Connection Oriented Service

Reliable Service
STREAM DELIVER SERVICE
TCP protocol is stream oriented because it allows the sending process to send
data as stream of bytes and the receiving process to obtain data as stream
of bytes.
18
SENDING AND RECEIVING BUFFERS
It may not be possible for sending and receiving process to produce and
obtain data at same speed, therefore, TCP needs buffers for storage at
sending and receiving ends.
BYTES AND SEGMENTS
The Transmission Control Protocol (TCP), at transport layer groups the bytes
into a packet. This packet is called segment. Before transmission of these
packets, these segments are encapsulated into an IP datagram.
FULL DUPLEX SERVICE
Transmitting the data in duplex mode means flow of data in both the
directions at the same time.
CONNECTION ORIENTED SERVICE
TCP offers connection oriented service in the following manner:
1. TCP of process-1 informs TCP of process – 2 and gets its approval.
2. TCP of process – 1 and TCP of process – 2 and exchange data in both the two
directions.
3. After completing the data exchange, when buffers on both sides are empty, the
two TCP’s destroy their buffers.
RELIABLE SERVICE
For sake of reliability, TCP uses acknowledgement mechanism.
Internet Protocol (IP)
Internet Protocol is connectionless and unreliable protocol. It ensures no
guarantee of successfully transmission of data.
In order to make it reliable, it must be paired with reliable protocol such as
TCP at the transport layer.
Internet protocol transmits the data in form of a datagram as shown in the
following diagram:
19
Points to remember:

The length of datagram is variable.

The Datagram is divided into two parts: header and data.

The length of header is 20 to 60 bytes.

The header contains information for routing and delivery of the packet.
User Datagram Protocol (UDP)
Like IP, UDP is connectionless and unreliable protocol. It doesn’t require
making a connection with the host to exchange data. Since UDP is unreliable
protocol, there is no mechanism for ensuring that data sent is received.
UDP transmits the data in form of a datagram. The UDP datagram consists of
five parts as shown in the following diagram:
20
Points to remember:

UDP is used by the application that typically transmit small amount of data at one
time.

UDP provides protocol port used i.e. UDP message contains both source and
destination port number, that makes it possible for UDP software at the
destination to deliver the message to correct application program.
File Transfer Protocol (FTP)
FTP is used to copy files from one host to another. FTP offers the mechanism
for the same in following manner:

FTP creates two processes such as Control Process and Data Transfer Process at
both ends i.e. at client as well as at server.

FTP establishes two different connections: one is for data transfer and other is for
control information.

Control
connection is
made
between control
processes while Data
Connection is made between <="" b="" style="box-sizing: border-box;">

FTP uses port 21 for the control connection and Port 20 for the data connection.
21
Trivial File Transfer Protocol (TFTP)
Trivial File Transfer Protocol is also used to transfer the files but it
transfers the files without authentication. Unlike FTP, TFTP does not separate
control and data information. Since there is no authentication exists, TFTP
lacks in security features therefore it is not recommended to use TFTP.
Key points

TFTP makes use of UDP for data transport. Each TFTP message is carried in
separate UDP datagram.

The first two bytes of a TFTP message specify the type of message.

The TFTP session is initiated when a TFTP client sends a request to upload or
download a file.
22

The request is sent from an ephemeral UDP port to the UDP port 69 of an TFTP
server.
Difference between FTP and TFTP
S.N.
Parameter
FTP
TFTP
1
Operation
Transferring Files
Transferring Files
2
Authentication
Yes
No
3
Protocol
TCP
UDP
4
Ports
21 – Control, 20 – Data
Port 3214, 69, 4012
5
Control and Data
Separated
Separated
6
Data Transfer
Reliable
Unreliable
Telnet
Telnet is a protocol used to log in to remote computer on the internet. There
are a number of Telnet clients having user friendly user interface. The
following diagram shows a person is logged in to computer A, and from there,
he remote logged into computer B.
Hyper Text Transfer Protocol (HTTP)
HTTP is a communication protocol. It defines mechanism for communication
between browser and the web server. It is also called request and response
23
protocol because the communication between browser and server takes place
in request and response pairs.
HTTP Request
HTTP request comprises of lines which contains:

Request line

Header Fields

Message body
Key Points

The first line i.e. the Request line specifies the request method i.e. Get or Post.

The second line specifies the header which indicates the domain name of the
server from where index.htm is retrieved.
HTTP Response
Like HTTP request, HTTP response also has certain structure. HTTP response
contains:

Status line

Headers

Message body
E-mail Protocols
Advertisements
Previous Page
Next Page
24
E-mail Protocols are set of rules that help the client to properly transmit the
information to or from the mail server. Here in this tutorial, we will discuss
various protocols such as SMTP, POP, and IMAP.
SMPTP
SMTP stands for Simple Mail Transfer Protocol. It was first proposed in
1982. It is a standard protocol used for sending e-mail efficiently and reliably
over the internet.
Key Points:

SMTP is application level protocol.

SMTP is connection oriented protocol.

SMTP is text based protocol.

It handles exchange of messages between e-mail servers over TCP/IP network.

Apart from transferring e-mail, SMPT also provides notification regarding incoming
mail.

When you send e-mail, your e-mail client sends it to your e-mail server which
further contacts the recipient mail server using SMTP client.

These SMTP commands specify the sender’s and receiver’s e-mail address, along
with the message to be send.

The exchange of commands between servers is carried out without intervention
of any user.

In case, message cannot be delivered, an error report is sent to the sender which
makes SMTP a reliable protocol.
SMTP Commands
The following table describes some of the SMTP commands:
S.N. Command Description
25
1
HELLO
This command initiates the SMTP conversation.
2
EHELLO
This is an alternative command to initiate the conversation. ESMTP indicates
that the sender server wants to use extended SMTP protocol.
3
MAIL FROM
This indicates the sender’s address.
4
RCPT TO
It identifies the recipient of the mail. In order to deliver similar message to
multiple users this command can be repeated multiple times.
5
SIZE
This command let the server know the size of attached message in bytes.
6
DATA
The DATA command signifies that a stream of data will follow. Here stream
of data refers to the body of the message.
7
QUIT
This commands is used to terminate the SMTP connection.
8
VERFY
This command is used by the receiving server in order to verify whether the
given username is valid or not.
9
EXPN
It is same as VRFY, except it will list all the users name when it used with a
distribution list.
IMAP
IMAP stands for Internet Mail Access Protocol. It was first proposed in
1986. There exist five versions of IMAP as follows:
1. Original IMAP
2. IMAP2
26
3. IMAP3
4. IMAP2bis
5. IMAP4
Key Points:

IMAP allows the client program to manipulate the e-mail message on the server
without downloading them on the local computer.

The e-mail is hold and maintained by the remote server.

It enables us to take any action such as downloading, delete the mail without
reading the mail.It enables us to create, manipulate and delete remote message
folders called mail boxes.

IMAP enables the users to search the e-mails.

It allows concurrent access to multiple mailboxes on multiple mail servers.
IMAP Commands
The following table describes some of the IMAP commands:
S.N. Command Description
1
IMAP_LOGIN
This command opens the connection.
2
CAPABILITY
This command requests for listing the capabilities that the server supports.
3
NOOP
This command is used as a periodic poll for new messages or message status
updates during a period of inactivity.
4
SELECT
This command helps to select a mailbox to access the messages.
27
5
EXAMINE
It is same as SELECT command except no change to the mailbox is
permitted.
6
CREATE
It is used to create mailbox with a specified name.
7
DELETE
It is used to permanently delete a mailbox with a given name.
8
RENAME
It is used to change the name of a mailbox.
9
LOGOUT
This command informs the server that client is done with the session. The
server must send BYE untagged response before the OK response and then
close the network connection.
POP
POP stands for Post Office Protocol. It is generally used to support a single
client. There are several versions of POP but the POP 3 is the current
standard.
Key Points

POP is an application layer internet standard protocol.

Since POP supports offline access to the messages, thus requires less internet
usage time.

POP does not allow search facility.

In order to access the messaged, it is necessary to download them.

It allows only one mailbox to be created on server.

It is not suitable for accessing non mail data.

POP commands are generally abbreviated into codes of three or four letters. Eg.
STAT.
28
POP Commands
The following table describes some of the POP commands:
S.N. Command Description
1
LOGIN
This command opens the connection.
2
STAT
It is used to display number of messages currently in the mailbox.
3
LIST
It is used to get the summary of messages where each message summary is
shown.
4
RETR
This command helps to select a mailbox to access the messages.
5
DELE
It is used to delete a message.
6
RSET
It is used to reset the session to its initial state.
7
QUIT
It is used to log off the session.
Comparison between POP and IMAP
S.N. POP
IMAP
1
Generally used to support single client.
Designed to handle multiple clients.
2
Messages are accessed offline.
Messages are accessed online
although it also supports offline
mode.
3
POP does not allow search facility.
It offers ability to search emails.
29
4
All the messages have to be
downloaded.
It allows selective transfer of
messages to the client.
5
Only one mailbox can be created on the Multiple mailboxes can be created
server.
on the server.
6
Not suitable for accessing non-mail
data.
Suitable for accessing non-mail data
i.e. attachment.
7
POP commands are generally
abbreviated into codes of three or four
letters. Eg. STAT.
IMAP commands are not
abbreviated, they are full. Eg.
STATUS.
8
It requires minimum use of server
resources.
Clients are totally dependent on
server.
9
Mails once downloaded cannot be
accessed from some other location.
Allows mails to be accessed from
multiple locations.
10
The e-mails are not downloaded
automatically.
Users can view the headings and
sender of e-mails and then decide to
download.
10
POP requires less internet usage time.
IMAP requires more internet usage
time.
Markup language
From Wikipedia, the free encyclopedia
For other uses, see Markup (disambiguation).
30
Example of RecipeBook, a simple markup language based on XML for creating recipes. The markup can be
converted to HTML, PDF and Rich Text Format using a programming languageor XSL.
A markup language is a system for annotating a document in a way that is syntactically
distinguishable from the text.[1] The idea and terminology evolved from the "marking up" of paper
manuscripts, i.e., the revision instructions by editors, traditionally written with a blue pencil on
authors' manuscripts.[citation needed]
In digital media this "blue pencil instruction text" was replaced by tags, that is, instructions are
expressed directly by tags or "instruction text encapsulated by tags." Examples include typesetting
instructions such as those found in troff, TeX and LaTeX, or structural markers such as XML tags.
Markup instructs the software that displays the text to carry out appropriate actions, but is omitted
from the version of the text that users see.
Some markup languages, such as the widely used HTML, have pre-defined presentation
semantics—meaning that their specification prescribes how to present the structured data. Others,
such as XML, do not have them and are general purpose.
HyperText Markup Language (HTML), one of the document formats of the World Wide Web, is an
instance of SGML (though, strictly, it does not comply with all the rules of SGML), and follows many
of the markup conventions used in the publishing industry in the communication of printed work
between authors, editors, and printers.[citation needed]
Types[edit]
There are three main general categories of electronic markup:[2][3]
Presentational markup
The kind of markup used by traditional word-processing systems: binary codes embedded
within document text that produce the WYSIWYG effect. Such markup is usually hidden from
human users, even authors or editors.
Procedural markup
Markup is embedded in text and provides instructions for programs that are to process the
text. Well-known examples include troff, TeX, and PostScript. It is expected that the
processor will run through the text from beginning to end, following the instructions as
encountered. Text with such markup is often edited with the markup visible and directly
manipulated by the author. Popular procedural-markup systems usually include
programming constructs, so macros or subroutines can be defined and invoked by name.
Descriptive markup
Markup is used to label parts of the document rather than to provide specific instructions as
to how they should be processed. Well-known examples include LaTeX, HTML, and XML.
The objective is to decouple the inherent structure of the document from any particular
treatment or rendition of it. Such markup is often described as "semantic". An example of
descriptive markup would be HTML's <cite> tag, which is used to label a citation.
Descriptive markup—sometimes called logical markup or conceptual markup—encourages
authors to write in a way that describes the material conceptually, rather than visually.[4]
There is considerable blurring of the lines between the types of markup. In modern
word-processing systems, presentational markup is often saved in descriptive-markuporiented systems such as XML, and then processed procedurally by implementations.
The programming constructs in procedural-markup systems such as TeX may be used
to create higher-level markup systems that are more descriptive, such as LaTeX.
31
In recent years, a number of small and largely unstandardized markup languages have
been developed to allow authors to create formatted text via web browsers, for use
in wikis and web forums. These are sometimes called lightweight markup
languages. Markdown or the markup language used by Wikipedia are examples of
such wiki markup.
History[edit]
Etymology and origin[edit]
The term markup is derived from the traditional publishing practice of "marking
up" a manuscript, which involves adding handwritten annotations in the form of
conventional symbolic printer's instructions in the margins and text of a paper
manuscript or printed proof. For centuries, this task was done primarily by skilled
typographers known as "markup men"[5] or "copy markers"[6] who marked up text to
indicate what typeface, style, and size should be applied to each part, and then passed
the manuscript to others for typesetting by hand. Markup was also commonly applied by
editors, proofreaders, publishers, and graphic designers, and indeed by document
authors.
GenCode[edit]
The first well-known public presentation of markup languages in computer text
processing was made by William W. Tunnicliffe at a conference in 1967, although he
preferred to call it generic coding. It can be seen as a response to the emergence of
programs such as RUNOFF that each used their own control notations, often specific to
the target typesetting device. In the 1970s, Tunnicliffe led the development of a standard
called GenCode for the publishing industry and later was the first chair of
the International Organization for Standardization committee that created SGML, the
first standard descriptive markup language. Book designer Stanley Rice published
speculation along similar lines in 1970.[7] Brian Reid, in his 1980 dissertation at Carnegie
Mellon University, developed the theory and a working implementation of descriptive
markup in actual use.
However, IBM researcher Charles Goldfarb is more commonly seen today as the
"father" of markup languages. Goldfarb hit upon the basic idea while working on a
primitive document management system intended for law firms in 1969, and helped
invent IBM GML later that same year. GML was first publicly disclosed in 1973.
In 1975, Goldfarb moved from Cambridge, Massachusetts to Silicon Valley and became
a product planner at the IBM Almaden Research Center. There, he convinced IBM's
executives to deploy GML commercially in 1978 as part of IBM's Document Composition
Facility product, and it was widely used in business within a few years.
SGML, which was based on both GML and GenCode, was developed by Goldfarb in
1974.[8] Goldfarb eventually became chair of the SGML committee. SGML was first
released by ISO as the ISO 8879 standard in October 1986.
troff and nroff[edit]
Some early examples of computer markup languages available outside the publishing
industry can be found in typesetting tools on Unix systems such as troff and nroff. In
these systems, formatting commands were inserted into the document text so that
typesetting software could format the text according to the editor's specifications. It was
a trial and error iterative process to get a document printed correctly.[9] Availability
of WYSIWYG ("what you see is what you get") publishing software supplanted much
32
use of these languages among casual users, though serious publishing work still uses
markup to specify the non-visual structure of texts, and WYSIWYG editors now usually
save documents in a markup-language-based format.
TeX[edit]
Another major publishing standard is TeX, created and refined by Donald Knuth in the
1970s and '80s. TeX concentrated on detailed layout of text and font descriptions to
typeset mathematical books. This required Knuth to spend considerable time
investigating the art of typesetting. TeX is mainly used in academia, where it is a de
facto standard in many scientific disciplines. A TeX macro package known
as LaTeX provides a descriptive markup system on top of TeX, and is widely used.
Scribe, GML and SGML[edit]
Main articles: Scribe (markup language), IBM Generalized Markup Language,
and Standard Generalized Markup Language
The first language to make a clean distinction between structure and presentation
was Scribe, developed by Brian Reid and described in his doctoral thesis in
1980.[10] Scribe was revolutionary in a number of ways, not least that it introduced the
idea of styles separated from the marked up document, and of a grammar controlling
the usage of descriptive elements. Scribe influenced the development of Generalized
Markup Language (later SGML) and is a direct ancestor to HTML and LaTeX.
In the early 1980s, the idea that markup should be focused on the structural aspects of
a document and leave the visual presentation of that structure to the interpreter led to
the creation of SGML. The language was developed by a committee chaired by
Goldfarb. It incorporated ideas from many different sources, including Tunnicliffe's
project, GenCode. Sharon Adler, Anders Berglund, and James A. Marke were also key
members of the SGML committee.
SGML specified a syntax for including the markup in documents, as well as one for
separately describing what tags were allowed, and where (the Document Type
Definition (DTD) or schema). This allowed authors to create and use any markup they
wished, selecting tags that made the most sense to them and were named in their own
natural languages. Thus, SGML is properly a meta-language, and many particular
markup languages are derived from it. From the late '80s on, most substantial new
markup languages have been based on SGML system, including for
example TEI and DocBook. SGML was promulgated as an International Standard
by International Organization for Standardization, ISO 8879, in 1986.
SGML found wide acceptance and use in fields with very large-scale documentation
requirements. However, many found it cumbersome and difficult to learn—a side effect
of its design attempting to do too much and be too flexible. For example, SGML made
end tags (or start-tags, or even both) optional in certain contexts, because its
developers thought markup would be done manually by overworked support staff who
would appreciate saving keystrokes[citation needed].
HTML[edit]
Main article: HTML
In 1989, physicist Sir Tim Berners-Lee wrote a memo proposing an Internetbased hypertext system,[11] then specified HTML and wrote the browser and server
software in the last part of 1990. The first publicly available description of HTML was a
document called "HTML Tags", first mentioned on the Internet by Berners-Lee in late
1991.[12][13] It describes 18 elements comprising the initial, relatively simple design of
33
HTML. Except for the hyperlink tag, these were strongly influenced by SGMLguid, an inhouse SGML-based documentation format at CERN. Eleven of these elements still exist
in HTML 4.[14]
Berners-Lee considered HTML an SGML application. The Internet Engineering Task
Force (IETF) formally defined it as such with the mid-1993 publication of the first
proposal for an HTML specification: "Hypertext Markup Language (HTML)" InternetDraft by Berners-Lee and Dan Connolly, which included an SGML Document Type
Definition to define the grammar.[15] Many of the HTML text elements are found in the
1988 ISO technical report TR 9537 Techniques for using SGML, which in turn covers
the features of early text formatting languages such as that used by the RUNOFF
command developed in the early 1960s for the CTSS (Compatible Time-Sharing
System) operating system. These formatting commands were derived from those used
by typesetters to manually format documents. Steven DeRose[16] argues that HTML's
use of descriptive markup (and influence of SGML in particular) was a major factor in
the success of the Web, because of the flexibility and extensibility that it enabled. HTML
became the main markup language for creating web pages and other information that
can be displayed in a web browser, and is quite likely the most used markup language
in the world today.
XML[edit]
Main article: XML
XML (Extensible Markup Language) is a meta markup language that is now widely
used. XML was developed by the World Wide Web Consortium, in a committee created
and chaired by Jon Bosak. The main purpose of XML was to simplify SGML by focusing
on a particular problem—documents on the Internet.[17] XML remains a meta-language
like SGML, allowing users to create any tags needed (hence "extensible") and then
describing those tags and their permitted uses.
XML adoption was helped because every XML document can be written in such a way
that it is also an SGML document, and existing SGML users and software could switch
to XML fairly easily. However, XML eliminated many of the more complex and humanoriented features of SGML to simplify implementation environments such as documents
and publications. However, it appeared to strike a happy medium between simplicity and
flexibility, and was rapidly adopted for many other uses. XML is now widely used for
communicating data between applications.
XHTML[edit]
Main article: XHTML
Since January 2000, all W3C Recommendations for HTML have been based on XML
rather than SGML, using the
abbreviation XHTML (Extensible HyperText Markup Language). The language
specification requires that XHTML Web documents must be well-formed XML
documents. This allows for more rigorous and robust documents while using tags
familiar from HTML.
One of the most noticeable differences between HTML and XHTML is the rule that all
tags must be closed: empty HTML tags such as <br> must either be closed with a
regular end-tag, or replaced by a special form: <br /> (the space before the ' / ' on the
end tag is optional, but frequently used because it enables some pre-XML Web
browsers, and SGML parsers, to accept the tag). Another is that all attribute values in
tags must be quoted. Finally, all tag and attribute names within the XHTML namespace
must be lowercase to be valid. HTML, on the other hand, was case-insensitive.
34
Other XML-based applications[edit]
Many XML-based applications now exist, including the Resource Description
Framework as RDF/XML, XForms, DocBook, SOAP, and the Web Ontology
Language (OWL). For a partial list of these, see List of XML markup languages.
Features[edit]
A common feature of many markup languages is that they intermix the text of a
document with markup instructions in the same data stream or file. This is not
necessary; it is possible to isolate markup from text content, using pointers, offsets, IDs,
or other methods to co-ordinate the two. Such "standoff markup" is typical for the
internal representations that programs use to work with marked-up documents.
However, embedded or "inline" markup is much more common elsewhere. Here, for
example, is a small section of text marked up in HTML:
<h1>Anatidae</h1>
<p>
The family <i>Anatidae</i> includes ducks, geese, and swans,
but <em>not</em> the closely related screamers.
</p>
The codes enclosed in angle-brackets <like this> are markup instructions (known
as tags), while the text between these instructions is the actual text of the document.
The codes h1 , p , and em are examples of semantic markup, in that they describe the
intended purpose or meaning of the text they include. Specifically, h1 means "this is a
first-level heading", p means "this is a paragraph", and em means "this is an
emphasized word or phrase". A program interpreting such structural markup may apply
its own rules or styles for presenting the various pieces of text, using different typefaces,
boldness, font size, indentation, colour, or other styles, as desired. A tag such as "h1"
(header level 1) might be presented in a large bold sans-serif typeface, for example, or
in a monospaced (typewriter-style) document it might be underscored – or it might not
change the presentation at all.
In contrast, the i tag in HTML is an example of presentational markup; it is generally
used to specify a particular characteristic of the text (in this case, the use of an italic
typeface) without specifying the reason for that appearance.
The Text Encoding Initiative (TEI) has published extensive guidelines[18] for how to
encode texts of interest in the humanities and social sciences, developed through years
of international cooperative work. These guidelines are used by projects encoding
historical documents, the works of particular scholars, periods, or genres, and so on.
35