08. Worms, Viruses and Spam

08. Worms, Viruses and Spam
Thomas-Ivo Heinen
[email protected]
Immatriculation Nr.: 6055050
University of Paderborn
Organizer: Christian Schindelhauer
August, 4th, 2004
Abstract
This document addresses two of the biggest problems in IT today.
On the one hand there is the increasing plague of viruses and worms
which is estimated to cost the european economy about 9 billion euro in
2004. As the battle between virus writers and anti-virus companies get
harder, the used means become more technologically advanced. In this
paper the basic mechanism of worms and viruses, called ”Exploiting” will
be covered.
Nowadays, worms and viruses install backdoors in infected machines
and form huge remote-controllable networks which are then used for massive spamming. The working time loss and hit on infrastructure in Europe
in this year will be about 9.2 billion euro according to [Wae03]. The second part of this document describes the cross-relations of Worm writers
and spammers and their basic approach on cooperation as well as the
theory behind the currently used spam filters.
1
1
Exploits and their prevention
So-called exploits are gates which open up systems to virus attacks. Basically
this is about programming glitches which can be used to introduce malicious
code. In this section the term of an exploit will be detailed and the currently
existing countermeasures are described.
1.1
Exploits
According to [Jar] an exploit is defined as follows:
”exploit n. [originally cracker slang] 1. A vulnerability in software that can be used for breaking security or otherwise attacking
an Internet host over the network.”
In particular this is about programming glitches where some circumstances were
not considered. For example in C/C++ you define an array of 200 elements.
But if an attempt is made to write to a higher element of the array, an exploit
can occur. So if some input handling asked the user to enter a number from 1
to 200 to change an element of the array and it is not checked if the entered
number is in bounds, this can be used to write to the memory at positions where
the program does not expect it.
The exploits that virus and worm writers deal with are mostly bugs in programs
like Microsoft Outlook or the operating system itself. As these consist of millions
of lines of code there are plenty of possibilities to find some exploit.
Basically this is about injecting executable code into the bugged programs which
is then executed and enables the possibility to copy the actual program (such
as a worm) onto the system.
Of course it is crucial to avoid these exploitable bugs and thus there are some
means to do so.
1.2
Software measures
Of course, making no errors in code would disable all exploits, but for every
thousand lines of code, there is an estimated 100 to 150 bugs. Even with extreme
caution there still remain up to five undiscovered errors in this amount of lines
[Kru03].
To minimize the probability of a program being exploited there are several types
of software.
Quality assurance and lint tools
From the start a programmer can assure a low bug rate by making it a habit
to check all external information such as user input or network communication.
But there still may remain situations where such an input is just used without
checks or an exact examination is not possible.
2
For this case there are various quality assurance tools out there, which are mostly
called ”lint” tools. Besides their original purpose to detect style glitches, there
is one implementation called ”splint” which specializes in recognizing dangerous
parts in software (http://www.splint.org/) and which covers the C programming language.
Tracing security concerns in software is a heavy task and with the complexity of
the programming language increasing the chances for detecting problems gets
hard. This is the reason why there are few tools like that for more advanced
languages.
Commercial CASE tools sometimes provide support for this, which is then called
”auditing”. One example of such a tool is Borland Together Control Center.
Compiler add-ons
Even if a program is already written, there can still be some way of securing
it. For a long time, tools like StackGuard and StackGhost have been available
which are invoked as preprocessors to compilers and modify dangerous code
parts to be less likely exploitable.
Mainly these addons try to randomize the address space (known as Address
Space Layout Randomization or ASLR). By randomizing the position of variables in the compiled binary it is much harder to guess which exact measures
have to be taken to exploit a program.
In addition on every call of a subprogram a cookie (also called canary) is placed
onto the stack [Lia04]. If the subprogram then quits, it is compared if the cookie
is still present or was overwritten because of some exploitation techniques. If
this is the case the whole application is closed for security reasons and thus
prevents execution of probably malicious code. But sadly there are still ways to
circumvent this way of protection [Ric02].
Virtual machines
Most exploits can be avoided by using a programming language which works
with virtual machines. Virtual machines are some sort of abstraction of the
actual computer either by completely simulating a computer or by separating
the executed binary code from the one the machine knows. In addition, virtual
machine-based languages automatically check the bounds of arrays to prevent
exploitable parts (see bounds checking in Java[Mic00] and .Net).
Operating system support
The most modern variants of operating systems support some form of prevention
against injection of foreign code in applications. For example the WindowsXP
Service Pack 2 does this on two ways [Bac].
On the one side it supports the DEP/NX features of 64 bit processors (see
below) and on the other side the Service Pack was compiled with some special
3
configuration of their own C/C++ compiler which uses so-called ”cookies” to
identify possible exploitation (see Compiler Add-ons above).
The problem with this feature is, that it is just some technique to prevent corefunctions of Windows to be exploited. Software of foreign vendors cannot use
this if it is not compiled with the Cookie feature which is only present in the most
recent versions of Microsoft’s compiler and special gcc versions (StackGuard,
SSP, ProPolice).
The same is true to the Linux Pax extension[Bus04], which is also some patch
to be included into the main linux kernel and then to secure it. But because
of the more open infrastructure (gcc usage is almost mandatory and for free)
it is very likely to be adapted throughout the whole community in quite short
time. Merging of the Pax support is said to occur within the 2.6 line of the
linux kernel.
1.3
Hardware measures
Data Execution Prevention (DEP/NX/EVP)
Data Execution Prevention describes a technique to protect programs form being exploited by marking memory region as ”non executable”. Due to the
architecture of this protection, it is just usable on computers with 64 bit wide
page table entries. The most current CPUs to allow DEP are the Intel Itanium
and the AMD Athlon64, both being true 64 bit processors.
It works by using a reserved bit in page table for marking areas as being nonexecutable. As this bit is the highest bit in the 64 bit pagetables, DEP is just
usable in 64 bit mode. For 32 bit mode there is currently just an implementation
on servers which have the Physical Address Extension (PAE) enabled.
On 32 bit platforms just the upcoming AMD Sempron processor will be able to
use DEP techniques under the name Enhanced Virus Protection (EVP) which
is a bit misleading. As we have seen DEP does not provide any anti-virus
functionality but just limits the possibility of using exploits This support on a
32 bit platform is possible, because the Sempron processors are based on the
Athlon64 and just stripped of the 64 bit mode but with remaining support for
64 bit page tables.
Palladium and Trusted Computing
Some method from disabling the execution of viruses and worms is still prepared.
The so-called Trusted Computing (TC) Initiative is a group of big hardware vendors which try to address all issues of Digital Rights Management (DRM), copy
protection, confidentality and security. In a TC-enabled computer the operating
system just executes verified programs and denies execution of the non-signed
ones. Thus just the applications which were certified by e.g. Microsoft will be
able to install and start up on a computer. As no worm/virus will ever get a
signed signature, their execution in a TC environment will be impossible.
4
While this sounds good, Palladium[AC02] and TC have huge privacy concerns.
In these secured environments, the tracking of office documents is possible, no
(probably allowed) private copying of CD-ROMs is possible any more and most
likely small software vendors and opensource programmers will not be able to
bring up the money needed for certification and would be locked out forever.
5
2
Spam: History, Trends and Filtering
The original term used for mass-mailings that the reciever did not request was
”Unsolicited Bulk Email” or in short ”UBE”. Lateron in this term the word
”Bulk” (used for classifying mass mailings) was replaced by ”Commercial” as
most of these mails were and still are used to advertise goods.
Nowadays the term ”spam” is used for these mails, as it was more memorizable
and abbreviations normally are not as handy. The original meaning of spam is
”spiced pork and ham” and describes canned meat which was very popular in
times of World War II when it became the primary source of food for the U.S.
troops as well as the residents of the United Kingdom. After years of eating
spam, as it was not rationed, people were fed up of it and the famous comedians
of Monty Python made a sketch about spam in which the word is mentioned a
dozen times and is always present to annoy.
This property of being annoying in the sketch started the usage of ”spam”
for describing unwanted mailings as these were as well annoying and present
everywhere.
The estimates about spam mails sent vary but the value of 25 million per day
sounds fairly reasonable. Other sources claim that about 32% of all mails sent
would be spam [Stu04], costing $874 per person of productivity loss [Inc03].
The value of spam an email user recieves varies upon many factors like his
participation on the net (especially newsgroups) and his own handling of this
address. As most of the sites nowadays require registration and this usually
includes the user’s email address, users which register at all the sites will get
more spam due to selling of personal data.
So the amount of spam can vary between 10% and 90% (as in the inbox of this
author).
6
2.1
Viruses and Spam
Although one normally would not think so there are strong connections between
the writers of worms and viruses and spam nowadays. To explain this we have
to do some short glance on the way spam works.
Spammers usually do not send any mail under their address because of the
possible replys they would get from angry users. So they use fake addresses
which in fact do not even exist and are chosen at random. This is possible
because of design flaws in the protocol used to send e-mails on the Internet
because in early times the Internet was a network of trust.
The basic procedure for sending mail is: connect to a mail server, enter the
recipient, enter additional information (sender email) and the mail itself. As we
can see there is no authentication or checking involved and a malicious sender
is able to just use some inexistant or foreign address.
Of course it would be problematic if some spammer used their own internet
provider’s server for this purpose and so they hunt for so-called ”open relays” on
the net. Open relays are misconfigured servers which accept mail for any domain
and deliver it instead of just accept mail for the domain they are responsible
for. Finding new open relay servers is somehow hard and open servers often
get a new, secure configuration within days because the huge amount of mail is
detected.
So spammers need another source of mail relays on the net. The easiest part for
this is writing specialized worms which open up a backdoor on infected machines
which can then be used to send spam.
On the one side there are worms which open up an ordinary SMTP server
without the user of the computer knowing. A spammer then just has to connect
to this SMTP service and can start sending mails. But there is a drawback:
ordinary users normally have a rather slow connection compared to real mail
servers as they may use an old modem or ISDN for connecting. So a spammer
who uses these backdoored worms has to use multiple at once and for example
use 20 or more infected machines in parallel to get the desired throughput which
is usually above 500,000 mails per hour as this is the speed conventional mass
mailing software offers[Dyn04].
Finding these widely distributed infected machines can be a pain as they are
spread all over the world and some of them are probably of little use because of
slow speed or high packet loss. So the most sophisticated spam related viruses
use a much more modern technique.
Modern spam worms establish a complete peer-to-peer network, using modern
algorithms from serious P2P research. They add some encryption layer to these
and then these large networks can be utilized for sending out for massive spam
waves.
There are custom tools for scanning the Internet for infected machines and then
hooking up into the established network, so just one machine has to be detected
and the full power of thousands of machines is open to the mass mailer. For
utilizing these the spammer can actually choose how many of the nodes should
7
be used for his task (he is limited by his own upload bandwidth anyway) and
then just assign the individual mail delivery based on the detected bandwith of
the infected machine.
Most recently there occured some kind of personal wage between the writers of
the Bagle, Netsky and MyDoom viruses. This is not only because Bagle and MyDoom are written to exploit the same bugs in Windows operating systems but
rather because the author of Netsky (who is claimed to be a hobbyist rather
than a spam group member like the others) implemented his worm to delete
existing instances of Bagle and MyDoom on infected machines. So the P2P networks of Bagle and MyDoom experience a high fluctuation rate and become less
useful as most of the worm ressources has to be used to battle the other worms
instead of finding new victims. This personal wage can be clearly observed in
excerpts of the worm code where the authors harass each other[NS04].
Finding information about the exact techniques used to build up and securing of
the used P2P networks as well as the actual usage and ”cost” of the fluctuation
on the networks is of course little available. It is important for the spam groups
and virus writes to keep this information secret. This is not only because of antivirus companies would be able to detect and clean infected machines better, but
also to protect from usage of the networks by rivaling groups.
2.2
Spam filtering
There have been many attempts in filtering spam to re-establish a good overview
on the personal inboxes. In the start of spam filters the technique of good/bad
word lists was very common. To use this, extensive lists of words that are unique
to spam (bad list) and words that never occur (good list) were kept. On every
recieved mail a program looked through it and classified a mail as spam if a
certain amount of bad words was detected.
Lateron, when the senders of mass mailings adapted, this method started to
fail.
This is where the most recent technology of spam filters comes into play. The
most popular is the Bayesian filter which bases on probabilities and statistic
data, which is even able to learn the user’s habits.
Bayesian filters
Bayes filters (also called Bayesian filters) use knowledge out of statistics and
probability theory to estimate if an email contains unwanted contents. These
filters are the most-used type of anti-spam utilities at the moment and already
provide very high recognition rates with a low number of false positives.
Bayesian filters are named after Thomas Bayes (1702-1761), an english mathematician which researched in the area of probability, developing a theory on the
inference of probabilities. The most known formula which originated from him
is
P(A|B) P(B) = P(A,B) = P(B|A) P(A)
8
which describes the relationship between the probabilities of two events where
P(A|B) is the probability of event A when event B already happened and P(A,B)
is for both events having happened.
The basic idea of a Bayesian filter is that a mail is not filtered due to some
words on a blacklist, but instead calculating a probability of a given mail to be
spam. For this purpose a Bayesian filter first has to learn about the habit of
its user by tracking outgoing mail and the decisions on ”Spam or no Spam” of
the user on incoming mail. With sufficient input the Bayes filter then starts
to calculate probabilities for word combinations (”buy” and ”viagra” in a mail
will most likely be spam). There are some words used which classify a mail
almost certainly as spam [Gra02]. Although naive Bayesian filters are quite
easy to implement, there is more math involved in more modern approaches to
maximize recognition rate [Wei03].
By this mean the filter adapts to the user habits and his mail dialogues in a
way that is generally called ”context sensitive”. Thus some urologist who will
be using ”viagra” in non-spam mails will get an adaptive filter which does not
touch important mail.
There are many ways to improve the described naive Bayesian filters. For example an advanced filter could recognize conjugated verbs as being of the same
origin. And more modern filters also have to know Hypertext Markup Language
(HTML) as most of the spam mails today arrive in HTML form. For cloaking
the words, these are interrupted by (undefined) HTML tags. As a browser does
not look at these the word looks ok in the mail reader while a more naive filter
will not recognize them. In addition, by defining so-called ”magnets”, a mail
will immediately be thrown into either spam or ham (wanted mails) if the word
is encountered. This usage of a magnet can be very important to sort out press
releases into the wanted mail folder [Lin].
For the basic learning strategy of filters there are three categories:
Self-strenghtening:
In this way of learning the filter modifies the word probabilities based on
his own classification. Thus he is always confirming his own decisions and
can easier adapt to new, varying spam.
On-demand:
On the more user-oriented side, the automatic learning of the filter is
turned off. Usually the user, if noting the classification rate decreases,
moves more recent spam and ham messages into the filter to adjust the
probabilities.
Difference learning:
The difference-learning approach uses two folders, where the user sorts in
wrongly-classified spam and approved ham. The learning of the filter is
then relying solely upon these two filters.
The most advanced spam currently is not filterable by Bayesian Filters as they
rely on the text in a mail but not check images. There are already some mailings
out there which show the spam-like contents in an image while the text just
consists of typical Ham-words thus these mails will pass the filters untouched
[Lew04].
9
References
[AC02] Mario Juarez et al Amy Carroll. Microsoft palladium: A business
overview. 2002.
[Bac]
Daniel Bachfeld. Surf-versicherung. cT Magazin fuer Computertechnik,
page 105.
[Bus04] Peter Busser. Linux Magazine, 2004.
[Dyn04] DynamicSoftware. Faq for mail communicator. 2004.
[Gra02] Paul Graham. Better bayesian filtering. 2002.
[Inc03]
Nucleus Research Inc. Spam: The silent roi killer. 2003.
[Jar]
JargonFile. The jargon file, a comprehensive compendium of hacker
slang.
[Kru03] Karl S. Kruszelnicki. Great moments in science - software sucks. 2003.
[Lew04] David D. Lewis. (naive) bayesian text classification for spam filtering.
2004.
[Lia04] Zhenkai Liang. Defensing stack smashing attack. 2004.
[Lin]
Andreas Linke. Spam oder nicht spam. cT Magazin fuer Computertechnik, pages 150–153.
[Mic00] Sun Microsystems. The java language specification, second edition.
2000.
[NS04]
Net-Security.org. The creators of bagle, mydoom and netsky exchange
pleasantries. 2004.
[Ric02] Gerardo Richarte. Four different tricks to bypass stackshield and stackguard protection. 2002.
[Stu04] IDC Studies. The true cost of spam and value of anti-spam solutions.
2004.
[Wae03] Graeme Waerden. Eu to lose billions through spam and viruses. 2003.
[Wei03] Kai Wei. A naive bayes spam filter. 2003.
10