08. Worms, Viruses and Spam Thomas-Ivo Heinen [email protected] Immatriculation Nr.: 6055050 University of Paderborn Organizer: Christian Schindelhauer August, 4th, 2004 Abstract This document addresses two of the biggest problems in IT today. On the one hand there is the increasing plague of viruses and worms which is estimated to cost the european economy about 9 billion euro in 2004. As the battle between virus writers and anti-virus companies get harder, the used means become more technologically advanced. In this paper the basic mechanism of worms and viruses, called ”Exploiting” will be covered. Nowadays, worms and viruses install backdoors in infected machines and form huge remote-controllable networks which are then used for massive spamming. The working time loss and hit on infrastructure in Europe in this year will be about 9.2 billion euro according to [Wae03]. The second part of this document describes the cross-relations of Worm writers and spammers and their basic approach on cooperation as well as the theory behind the currently used spam filters. 1 1 Exploits and their prevention So-called exploits are gates which open up systems to virus attacks. Basically this is about programming glitches which can be used to introduce malicious code. In this section the term of an exploit will be detailed and the currently existing countermeasures are described. 1.1 Exploits According to [Jar] an exploit is defined as follows: ”exploit n. [originally cracker slang] 1. A vulnerability in software that can be used for breaking security or otherwise attacking an Internet host over the network.” In particular this is about programming glitches where some circumstances were not considered. For example in C/C++ you define an array of 200 elements. But if an attempt is made to write to a higher element of the array, an exploit can occur. So if some input handling asked the user to enter a number from 1 to 200 to change an element of the array and it is not checked if the entered number is in bounds, this can be used to write to the memory at positions where the program does not expect it. The exploits that virus and worm writers deal with are mostly bugs in programs like Microsoft Outlook or the operating system itself. As these consist of millions of lines of code there are plenty of possibilities to find some exploit. Basically this is about injecting executable code into the bugged programs which is then executed and enables the possibility to copy the actual program (such as a worm) onto the system. Of course it is crucial to avoid these exploitable bugs and thus there are some means to do so. 1.2 Software measures Of course, making no errors in code would disable all exploits, but for every thousand lines of code, there is an estimated 100 to 150 bugs. Even with extreme caution there still remain up to five undiscovered errors in this amount of lines [Kru03]. To minimize the probability of a program being exploited there are several types of software. Quality assurance and lint tools From the start a programmer can assure a low bug rate by making it a habit to check all external information such as user input or network communication. But there still may remain situations where such an input is just used without checks or an exact examination is not possible. 2 For this case there are various quality assurance tools out there, which are mostly called ”lint” tools. Besides their original purpose to detect style glitches, there is one implementation called ”splint” which specializes in recognizing dangerous parts in software (http://www.splint.org/) and which covers the C programming language. Tracing security concerns in software is a heavy task and with the complexity of the programming language increasing the chances for detecting problems gets hard. This is the reason why there are few tools like that for more advanced languages. Commercial CASE tools sometimes provide support for this, which is then called ”auditing”. One example of such a tool is Borland Together Control Center. Compiler add-ons Even if a program is already written, there can still be some way of securing it. For a long time, tools like StackGuard and StackGhost have been available which are invoked as preprocessors to compilers and modify dangerous code parts to be less likely exploitable. Mainly these addons try to randomize the address space (known as Address Space Layout Randomization or ASLR). By randomizing the position of variables in the compiled binary it is much harder to guess which exact measures have to be taken to exploit a program. In addition on every call of a subprogram a cookie (also called canary) is placed onto the stack [Lia04]. If the subprogram then quits, it is compared if the cookie is still present or was overwritten because of some exploitation techniques. If this is the case the whole application is closed for security reasons and thus prevents execution of probably malicious code. But sadly there are still ways to circumvent this way of protection [Ric02]. Virtual machines Most exploits can be avoided by using a programming language which works with virtual machines. Virtual machines are some sort of abstraction of the actual computer either by completely simulating a computer or by separating the executed binary code from the one the machine knows. In addition, virtual machine-based languages automatically check the bounds of arrays to prevent exploitable parts (see bounds checking in Java[Mic00] and .Net). Operating system support The most modern variants of operating systems support some form of prevention against injection of foreign code in applications. For example the WindowsXP Service Pack 2 does this on two ways [Bac]. On the one side it supports the DEP/NX features of 64 bit processors (see below) and on the other side the Service Pack was compiled with some special 3 configuration of their own C/C++ compiler which uses so-called ”cookies” to identify possible exploitation (see Compiler Add-ons above). The problem with this feature is, that it is just some technique to prevent corefunctions of Windows to be exploited. Software of foreign vendors cannot use this if it is not compiled with the Cookie feature which is only present in the most recent versions of Microsoft’s compiler and special gcc versions (StackGuard, SSP, ProPolice). The same is true to the Linux Pax extension[Bus04], which is also some patch to be included into the main linux kernel and then to secure it. But because of the more open infrastructure (gcc usage is almost mandatory and for free) it is very likely to be adapted throughout the whole community in quite short time. Merging of the Pax support is said to occur within the 2.6 line of the linux kernel. 1.3 Hardware measures Data Execution Prevention (DEP/NX/EVP) Data Execution Prevention describes a technique to protect programs form being exploited by marking memory region as ”non executable”. Due to the architecture of this protection, it is just usable on computers with 64 bit wide page table entries. The most current CPUs to allow DEP are the Intel Itanium and the AMD Athlon64, both being true 64 bit processors. It works by using a reserved bit in page table for marking areas as being nonexecutable. As this bit is the highest bit in the 64 bit pagetables, DEP is just usable in 64 bit mode. For 32 bit mode there is currently just an implementation on servers which have the Physical Address Extension (PAE) enabled. On 32 bit platforms just the upcoming AMD Sempron processor will be able to use DEP techniques under the name Enhanced Virus Protection (EVP) which is a bit misleading. As we have seen DEP does not provide any anti-virus functionality but just limits the possibility of using exploits This support on a 32 bit platform is possible, because the Sempron processors are based on the Athlon64 and just stripped of the 64 bit mode but with remaining support for 64 bit page tables. Palladium and Trusted Computing Some method from disabling the execution of viruses and worms is still prepared. The so-called Trusted Computing (TC) Initiative is a group of big hardware vendors which try to address all issues of Digital Rights Management (DRM), copy protection, confidentality and security. In a TC-enabled computer the operating system just executes verified programs and denies execution of the non-signed ones. Thus just the applications which were certified by e.g. Microsoft will be able to install and start up on a computer. As no worm/virus will ever get a signed signature, their execution in a TC environment will be impossible. 4 While this sounds good, Palladium[AC02] and TC have huge privacy concerns. In these secured environments, the tracking of office documents is possible, no (probably allowed) private copying of CD-ROMs is possible any more and most likely small software vendors and opensource programmers will not be able to bring up the money needed for certification and would be locked out forever. 5 2 Spam: History, Trends and Filtering The original term used for mass-mailings that the reciever did not request was ”Unsolicited Bulk Email” or in short ”UBE”. Lateron in this term the word ”Bulk” (used for classifying mass mailings) was replaced by ”Commercial” as most of these mails were and still are used to advertise goods. Nowadays the term ”spam” is used for these mails, as it was more memorizable and abbreviations normally are not as handy. The original meaning of spam is ”spiced pork and ham” and describes canned meat which was very popular in times of World War II when it became the primary source of food for the U.S. troops as well as the residents of the United Kingdom. After years of eating spam, as it was not rationed, people were fed up of it and the famous comedians of Monty Python made a sketch about spam in which the word is mentioned a dozen times and is always present to annoy. This property of being annoying in the sketch started the usage of ”spam” for describing unwanted mailings as these were as well annoying and present everywhere. The estimates about spam mails sent vary but the value of 25 million per day sounds fairly reasonable. Other sources claim that about 32% of all mails sent would be spam [Stu04], costing $874 per person of productivity loss [Inc03]. The value of spam an email user recieves varies upon many factors like his participation on the net (especially newsgroups) and his own handling of this address. As most of the sites nowadays require registration and this usually includes the user’s email address, users which register at all the sites will get more spam due to selling of personal data. So the amount of spam can vary between 10% and 90% (as in the inbox of this author). 6 2.1 Viruses and Spam Although one normally would not think so there are strong connections between the writers of worms and viruses and spam nowadays. To explain this we have to do some short glance on the way spam works. Spammers usually do not send any mail under their address because of the possible replys they would get from angry users. So they use fake addresses which in fact do not even exist and are chosen at random. This is possible because of design flaws in the protocol used to send e-mails on the Internet because in early times the Internet was a network of trust. The basic procedure for sending mail is: connect to a mail server, enter the recipient, enter additional information (sender email) and the mail itself. As we can see there is no authentication or checking involved and a malicious sender is able to just use some inexistant or foreign address. Of course it would be problematic if some spammer used their own internet provider’s server for this purpose and so they hunt for so-called ”open relays” on the net. Open relays are misconfigured servers which accept mail for any domain and deliver it instead of just accept mail for the domain they are responsible for. Finding new open relay servers is somehow hard and open servers often get a new, secure configuration within days because the huge amount of mail is detected. So spammers need another source of mail relays on the net. The easiest part for this is writing specialized worms which open up a backdoor on infected machines which can then be used to send spam. On the one side there are worms which open up an ordinary SMTP server without the user of the computer knowing. A spammer then just has to connect to this SMTP service and can start sending mails. But there is a drawback: ordinary users normally have a rather slow connection compared to real mail servers as they may use an old modem or ISDN for connecting. So a spammer who uses these backdoored worms has to use multiple at once and for example use 20 or more infected machines in parallel to get the desired throughput which is usually above 500,000 mails per hour as this is the speed conventional mass mailing software offers[Dyn04]. Finding these widely distributed infected machines can be a pain as they are spread all over the world and some of them are probably of little use because of slow speed or high packet loss. So the most sophisticated spam related viruses use a much more modern technique. Modern spam worms establish a complete peer-to-peer network, using modern algorithms from serious P2P research. They add some encryption layer to these and then these large networks can be utilized for sending out for massive spam waves. There are custom tools for scanning the Internet for infected machines and then hooking up into the established network, so just one machine has to be detected and the full power of thousands of machines is open to the mass mailer. For utilizing these the spammer can actually choose how many of the nodes should 7 be used for his task (he is limited by his own upload bandwidth anyway) and then just assign the individual mail delivery based on the detected bandwith of the infected machine. Most recently there occured some kind of personal wage between the writers of the Bagle, Netsky and MyDoom viruses. This is not only because Bagle and MyDoom are written to exploit the same bugs in Windows operating systems but rather because the author of Netsky (who is claimed to be a hobbyist rather than a spam group member like the others) implemented his worm to delete existing instances of Bagle and MyDoom on infected machines. So the P2P networks of Bagle and MyDoom experience a high fluctuation rate and become less useful as most of the worm ressources has to be used to battle the other worms instead of finding new victims. This personal wage can be clearly observed in excerpts of the worm code where the authors harass each other[NS04]. Finding information about the exact techniques used to build up and securing of the used P2P networks as well as the actual usage and ”cost” of the fluctuation on the networks is of course little available. It is important for the spam groups and virus writes to keep this information secret. This is not only because of antivirus companies would be able to detect and clean infected machines better, but also to protect from usage of the networks by rivaling groups. 2.2 Spam filtering There have been many attempts in filtering spam to re-establish a good overview on the personal inboxes. In the start of spam filters the technique of good/bad word lists was very common. To use this, extensive lists of words that are unique to spam (bad list) and words that never occur (good list) were kept. On every recieved mail a program looked through it and classified a mail as spam if a certain amount of bad words was detected. Lateron, when the senders of mass mailings adapted, this method started to fail. This is where the most recent technology of spam filters comes into play. The most popular is the Bayesian filter which bases on probabilities and statistic data, which is even able to learn the user’s habits. Bayesian filters Bayes filters (also called Bayesian filters) use knowledge out of statistics and probability theory to estimate if an email contains unwanted contents. These filters are the most-used type of anti-spam utilities at the moment and already provide very high recognition rates with a low number of false positives. Bayesian filters are named after Thomas Bayes (1702-1761), an english mathematician which researched in the area of probability, developing a theory on the inference of probabilities. The most known formula which originated from him is P(A|B) P(B) = P(A,B) = P(B|A) P(A) 8 which describes the relationship between the probabilities of two events where P(A|B) is the probability of event A when event B already happened and P(A,B) is for both events having happened. The basic idea of a Bayesian filter is that a mail is not filtered due to some words on a blacklist, but instead calculating a probability of a given mail to be spam. For this purpose a Bayesian filter first has to learn about the habit of its user by tracking outgoing mail and the decisions on ”Spam or no Spam” of the user on incoming mail. With sufficient input the Bayes filter then starts to calculate probabilities for word combinations (”buy” and ”viagra” in a mail will most likely be spam). There are some words used which classify a mail almost certainly as spam [Gra02]. Although naive Bayesian filters are quite easy to implement, there is more math involved in more modern approaches to maximize recognition rate [Wei03]. By this mean the filter adapts to the user habits and his mail dialogues in a way that is generally called ”context sensitive”. Thus some urologist who will be using ”viagra” in non-spam mails will get an adaptive filter which does not touch important mail. There are many ways to improve the described naive Bayesian filters. For example an advanced filter could recognize conjugated verbs as being of the same origin. And more modern filters also have to know Hypertext Markup Language (HTML) as most of the spam mails today arrive in HTML form. For cloaking the words, these are interrupted by (undefined) HTML tags. As a browser does not look at these the word looks ok in the mail reader while a more naive filter will not recognize them. In addition, by defining so-called ”magnets”, a mail will immediately be thrown into either spam or ham (wanted mails) if the word is encountered. This usage of a magnet can be very important to sort out press releases into the wanted mail folder [Lin]. For the basic learning strategy of filters there are three categories: Self-strenghtening: In this way of learning the filter modifies the word probabilities based on his own classification. Thus he is always confirming his own decisions and can easier adapt to new, varying spam. On-demand: On the more user-oriented side, the automatic learning of the filter is turned off. Usually the user, if noting the classification rate decreases, moves more recent spam and ham messages into the filter to adjust the probabilities. Difference learning: The difference-learning approach uses two folders, where the user sorts in wrongly-classified spam and approved ham. The learning of the filter is then relying solely upon these two filters. The most advanced spam currently is not filterable by Bayesian Filters as they rely on the text in a mail but not check images. There are already some mailings out there which show the spam-like contents in an image while the text just consists of typical Ham-words thus these mails will pass the filters untouched [Lew04]. 9 References [AC02] Mario Juarez et al Amy Carroll. Microsoft palladium: A business overview. 2002. [Bac] Daniel Bachfeld. Surf-versicherung. cT Magazin fuer Computertechnik, page 105. [Bus04] Peter Busser. Linux Magazine, 2004. [Dyn04] DynamicSoftware. Faq for mail communicator. 2004. [Gra02] Paul Graham. Better bayesian filtering. 2002. [Inc03] Nucleus Research Inc. Spam: The silent roi killer. 2003. [Jar] JargonFile. The jargon file, a comprehensive compendium of hacker slang. [Kru03] Karl S. Kruszelnicki. Great moments in science - software sucks. 2003. [Lew04] David D. Lewis. (naive) bayesian text classification for spam filtering. 2004. [Lia04] Zhenkai Liang. Defensing stack smashing attack. 2004. [Lin] Andreas Linke. Spam oder nicht spam. cT Magazin fuer Computertechnik, pages 150–153. [Mic00] Sun Microsystems. The java language specification, second edition. 2000. [NS04] Net-Security.org. The creators of bagle, mydoom and netsky exchange pleasantries. 2004. [Ric02] Gerardo Richarte. Four different tricks to bypass stackshield and stackguard protection. 2002. [Stu04] IDC Studies. The true cost of spam and value of anti-spam solutions. 2004. [Wae03] Graeme Waerden. Eu to lose billions through spam and viruses. 2003. [Wei03] Kai Wei. A naive bayes spam filter. 2003. 10
© Copyright 2026 Paperzz