Impact of Search Engine Editorial Policies on Small Business and

Impact of Search Engine Editorial Policies on
Small Business and Freedom of Information
Ted Goldsmith
Search Engine Honesty
Azinet LLC
April, 2007 REV 7/07
Introduction
Search engines have become an essential part of the way we use the Internet to
communicate. Although nearly anyone can now develop and operate a web site, public
access to all those sites is dependent on search engines. In contrast to the proliferation of
web sites1, market forces act to consolidate search. At present, only three major search
engines (Google, Yahoo Search, and Microsoft Search) provide more than 90 percent of
all U.S. searches2 and Google’s share is more than 60 percent and rapidly increasing.
Continuing consolidation is likely. The same companies are also extremely strong
worldwide and support many languages.
Courts have determined that search engine results pages are publications and that
therefore search engines have editorial free-speech rights similar to those of a newspaper
to edit, bias, censor3, manipulate, and otherwise alter search engine results4 in almost any
desired manner. The modern information processing technology used by search engines
can be used to implement any desired editorial policy and all the major search engines
have policies that direct how they process and display search results. As will be
1
Web Site: A collection of web pages hosted at a single domain name (e.g. somedomainname.com).
Search Traffic: Nielsen//Netratings measured searches performed by U.S. home and work web surfers
for April 2007: Google and its partners (AOL Search) using Google data 60.6 percent of searches, Yahoo
21.9 percent, MSN Search 9.0 percent, Ask 1.8 percent, others (total) 8.5 percent; for July 2006: Google
and its partners (AOL Search) using Google data 55.5 percent of searches, Yahoo 23.8 percent, MSN
Search 9.6 percent, Ask 2.6 percent, others (total) 11.1 percent.
2
3
Censoring – Semantic Note: The term censoring is normally used to refer to government or other official
deletion or suppression of information provided by others. Redaction has a similar connotation. In
connection with traditional publications, the exercise of editorial authority to elect not to publish certain
information would probably be referred to as “editing.” However, in a traditional publication, such election
is typically a small part of the editorial process. Reporters do not write stories expecting that there is a
significant chance the story will be entirely rejected.
In search engines, editorial policies are primarily exerted by restricting or preventing access to information
provided by others that would otherwise be available. In this document, “censor”, for lack of a better word,
means deleting reference to or display of web site information in implementation of an editorial policy.
“Suppression” means negatively altering the display of certain information such as by altering page
position. Deletion or suppression of entire web sites is common.
4
Search Engine Results: Specifically non-sponsored or unpaid results listings (also known as organic
results) as opposed to paid or sponsored “results” (advertised web sites).
described, the current situation has a severe negative effect on small business and can be
expected to greatly impact our general freedom of information access.
Purpose of Search Engine Editorial Policies
Search engines all have similar mechanisms for operating including complex software
constructs or algorithms that govern how they acquire and index data from web pages
using robot “spiders” and how they rank different pages in search results. See Search
Engine Mechanics for a detailed description of these systems. Implementing these
systems involves judgments that could be considered part of an editorial policy. Most
search users who give it any thought realize that such algorithms and judgments must
exist in order to implement a search engine but assume these algorithms and judgments
are applied equally and fairly to all web sites in an essentially mechanical manner. The
spidering and ranking algorithms indeed contain a great amount of general criteria that
are applied equally to all web sites; however, search engines also exercise editorial
policies against specific individual web sites.
It has recently become widely known that Google is censoring search results displayed to
their Chinese language users (Google.cn) to conceal the existence of specific individual
sites that the Chinese government considers objectionable. It is much less well known
that all the major search engines also censor English language sites displayed to U.S.
users using similar techniques.
A major need for editorial control involves reduction of web spam5 and web site
deception6. People pick a search engine based on the perceived comprehensiveness of
search results (ability to find relevant pages), freshness (how often the search engine
visits pages and updates its index to reflect new information), and quality of results
(probability that a given result page is useful as opposed to “spam”). A poll conducted by
Search Engine Honesty indicates that the last factor is the most important for 60 percent
of users. It is therefore no surprise that search engines are trying hard to improve the
average quality of their results by censoring or suppressing spam sites and sites
employing deceptive techniques.
The Google Annual Report 2006 says under “Risk Factors”: “There is an ongoing and
increasing effort by ‘index spammers’ to develop ways to manipulate our web search
results. For example, people have attempted to link a group of web sites together to
manipulate web search results. … If our efforts to combat these and other forms of index
spamming are unsuccessful, our reputation for delivering relevant information could be
5
Web Spam: A web site listed in search results that turns out to have little or no useful content. Spam
sites irritate search users while consuming search engine resources. Spam is now an epidemic that
especially threatens the smaller search engines. Since new and used domain names are inexpensive, there
is a continuing flood of new spam sites using steadily improving technology.
6
Deceptive Web Sites: Sites that use deceptive practices to gain an unfair ranking advantage over
competing sites.
2
diminished. This would result in a decline in user traffic, which would damage our
business.”
All the majors (Google, Yahoo Search, MSN Search) admit on “webmaster guidelines”
pages to censoring access by their users to sites that employ deceptive practices to
unfairly increase their search engine exposure. Because, like the Chinese government,
search engines take the position that any site that they have deleted “has done something
wrong” they prefer the terms “banning” or “penalization” to “censoring.” However,
functionally, there is no difference. A site that has been censored (or “banned”,
“penalized”, “blackballed”, “blacklisted”, “de-listed”, or “removed from our index”)
cannot be found no matter how relevant its pages are to a search. Censored sites are
manually, on a site by site basis, removed from and barred from a search engine’s index.
(Search engines can also use site-unique bias7 to suppress access to individual web sites.)
Search engines see use of deceptive practices, also known as “black hat” search engine
optimization, as a pro-active attack on the search engine’s integrity and function and
therefore employ “punitive” procedures against small businesses using such practices.
None of the majors admit to any censoring or suppression on pages likely to be seen by
their users (people doing searches).
Note carefully that there is a difference between outright censoring, in which all
(sometimes all but one) of a site’s pages are removed from the index, and a “rank”
problem where it is only less likely that a site will be found. It is easy to determine if a
site has been deleted. (See Is Your Site Banned?) It is much harder to detect even gross
bias against an individual site in a search engine’s ranking algorithm or depth-of-crawl
algorithm. Search engines frequently make general changes in their algorithms causing
changes in the ranking of any particular site.
Search engines do delete sites for using deceptive practices but they also de-list sites for
inconvenient practices and may also remove sites for competitive reasons or other
editorial reasons. Generally speaking, small sites (less than 100 pages hosted on a domain
name) do not appear to be currently at risk for being banned except for clearly deceptive
practices. It is also possible for any site to fail to appear on a search engine because of
technical site configuration issues. A very small and insignificant site could conceivably
be missed by a major search engine, especially if it had very few incoming links from
other sites.
Nearly 60 percent of our poll respondents said that search engines should provide the
option of seeing censored results and 90 percent said that search users should at least be
advised that some sites had been intentionally deleted.
7
Site-Unique Bias: The use of a search engine ranking or depth-of-crawl algorithm that contains or refers
to a list of domain names, trademarked names, site-unique phrases, or other site-unique or business-unique
information in order to negatively or positively bias search result ranking or page indexing of individual,
hand-selected web sites.
3
Deceptive Practices
Deceptive practices involve features of a web site specifically designed to “trick” search
engines. Such practices are designed to take advantage of weaknesses in a search engine's
system in order to get an unfair advantage in search engine exposure. The following is a
list of common deceptive practices:
-Using invisible text (same color as the background) to feed different text to the search
engine spider from that seen by a viewer; using tiny type at the bottom of a page for the
same purpose; “stuffing” keywords in “ALT” or “Keywords” tags (usually not seen by
viewers); many other similar techniques.
-Programming a web server to detect when it has been accessed by a search engine’s
spider and feeding the robot different information than would be received by a viewer
(“cloaking”).
-Use of multiple “doorway pages” that are each designed to be optimum for a particular
search engine.
- Links in locations or on pages that would seldom or never be seen by human visitors for
the purpose of gaming link popularity.
-Gross duplication of data such as hosting the same site on multiple domain names. (See
The Redundancy Explosion.)
Deceptive practices are typically aimed at improving the search results rank a site would
have for particular keywords relative to a “legitimate” site on the same subject. This
problem is made more difficult by the fact that search engines are extremely reluctant to
define “legitimate” web site design practices in any detail. “Deceptive” is therefore a gray
area and many small-business sites are enticed by unscrupulous “black hat” search engine
optimization operators into using techniques that may cause temporary rank improvement
but eventually result in penalization.
A second goal may be to increase site traffic generally by using hidden keywords for
popular but off-topic subjects. This could be useful if the site is displaying pay-byimpression advertising or advertising a subject of very general interest (e.g. CocaCola).
Search engines may be willing to describe the particular deceptive practice causing
censoring if requested by a webmaster. In addition, Google is reported to be testing a
system whereby webmasters of sites censored for a deceptive practice would be advised
by means of email to [email protected] that their site has been censored and the
reason for the action. All the major engines have procedures whereby webmasters that
notice that their site has been censored, determine the nature of the deceptive practice,
and fix it, can apply for reinstatement (See Webmaster Guidelines.)
Our impression is that sanctions for grossly deceptive practices are more or less fairly
applied. A large-business, Fortune 500 website engaging in clearly deceptive practices
will likely be censored as well as a minor site. A major car manufacturer’s site was
recently briefly censored by Google, apparently for using doorway pages. Reinstatement
(Google's term is “reinclusion”) is likely to be much slower for a small-business site that
4
has ceased the deceptive practice. Google is widely reported to have “punishment” or
“timeout” periods preceding site reinstatement.
Inconvenient Practices
Inconvenient practices involve features of web sites which, while not deceptive,
nonetheless represent a problem for a search engine. More specifically, the automated,
software-driven processing at the engine does not handle these features in a way that is
satisfactory for the search engine’s management. It is easier to manually delete thousands
of entire sites than fix the problems with the software. Notice that in this case the site
isn’t “doing anything wrong”; the problem is actually at the search engine. If the search
engine design were changed such that another feature became a problem, then sites
having that feature would also be censored. Sites that have been operating for five years
or more have been suddenly banned by search engines. (See Case Studies.)
Censoring for inconvenient practices is much less fairly applied. Banning of largebusiness sites for convenience, competitive, or editorial reasons is rarely, if ever done. If
Google censored Amazon for convenience, competitive, or editorial reasons there would
be hell to pay. Suits would be filed, Congressional investigations would be held, PR
campaigns would be executed. A small-business owner doesn’t have these advantages.
If a small-business site has been censored for an inconvenient practice, it may be very
difficult to determine which aspect of the site is causing the problem. Search engines are
understandably very reluctant to disclose, especially in writing, that they have suppressed
public access to an entire site for their own convenience. They are even more reluctant to
disclose that a site has been suppressed for criteria that are conspicuously not being
applied to other sites.
Here are some potentially inconvenient practices and features:
-Large number of pages – Sites with a large number of pages may be a problem for some
search engines. If the engine indexes the entire site, a large amount of search engine
resources (disk space, bandwidth) could be consumed by a site that might not be very
important. Normally, we would expect the depth-of-crawl algorithm to handle this by
indexing a relatively smaller number of pages in sites receiving relatively less traffic or
otherwise having less merit. There is increasing evidence that the major search engines
do indeed ban small-business sites merely for having a large number of pages. It is also
true that all medium and large sites have (by definition) a “large number of pages.”
-Links. Sites that have a large number of outgoing links such as directories or sites with a
large “links” page may tend to upset the link popularity scheme for some search engines.
Sites that have forums, message boards, guestbooks, blogs, or other features that allow
users to publish a link may also be seen as interfering with the link popularity concept.
Google’s PageRank site ranking algorithm is less susceptible to these problems because it
automatically penalizes pages for outgoing links while rewarding them for incoming
links. Some site owners claim that their sites have been censored merely for having a
5
links page. Buying or selling of links, or requiring links in return for a service could
improve a site’s link popularity.
Google says in a form email sent to web sites that inquire why they have been banned:
“Certain actions such as buying or selling links to increase a site’s PageRank value or cloaking writing text in such a way that it can be seen by search engines but not by users - can result in
penalization.”
The largest single buyer of links is probably Amazon, which has a very successful
affiliate programs in which web sites get a commission for duplicating Amazon provided
data and linking to Amazon pages. Needless to say, Google has not “penalized”
Amazon. Google reports (3/06) indexing 144 Million pages at Amazon.com! Yahoo sells
links from their directory. Google reports indexing 233 Million pages at
Yahoo.com. AOL/Netscape’s Open Directory Project (ODP) specifically requires users
of ODP data to install massive link farms pointing to ODP, the sort of practice that
Google’s annual report mentioned. Google has not instituted punitive procedures against
AOL. Google runs a copy of the Open Directory on its own site
(directory.google.com). Google reports (3/06) they index 12.4 Million pages in their own
directory, each page of which links to AOL’s ODP.
Censoring for Competition Suppression and
Editorial Reasons
Directories or other collections of links compete directly with search engines. Our case
studies strongly suggest that search engines censor small-business directory sites in order
to suppress competition. Search engines also engage in many other business activities
such as selling of things, provision of email, message board, video, photo, and mapping
services, etc. As long as it is legal to do so, it is unreasonable to expect that they would
not suppress larger competitive small-businesses in search results. Suppressing of other
larger small-business sites that compete with a search engine or compete with a business
partner of a search engine is also likely. Censoring a single small-business competitor
would certainly have no effect on the bottom line of a major search engine. Censoring
thousands of such sites would obviously have a beneficial effect.
There is currently no convincing evidence that any of the major search engines bans
small sites (less than 100 pages) for editorial or competitive reasons. There are plenty of
small “Google sucks” sites out there that have not been banned by Google.
The National Security Argument
There is a National Security Argument that goes to the effect of: “We are punishing you
but we can’t say why we are punishing you or give you an opportunity to defend yourself
because doing so might disclose information that could be used by the enemy.” Search
engine people use a version of this argument to justify their refusal to disclose why a
6
particular site has been censored. The idea is that a spammer might have found and
exploited a previously undisclosed weakness in a search engine. If the search engine
discloses the reason the site has been banned, it might add some confirmation that the
deceptive technique works. The spammer might spread the word or be more likely to use
the technique on another site.
This argument may have had some validity ten years ago but is currently ridiculous.
Search engines have been around for a long time (by Internet standards). Search
technology is well developed. Abuse methods are now well known; you just read about
most of them. A spammer that has implausibly found a new weakness has other ways to
measure the effectiveness of its method.
A much more plausible explanation is that search engines need to conceal the fact that the
site is being banned for a practice that others are being allowed to continue (buying links,
duplication of data, directories, links pages, guestbooks, message boards, etc.) or that the
site is being censored for competitive or arbitrary editorial reasons. Search engines are
using the “national security argument” to conceal their own discriminatory practices.
Google has announced a plan to notify some webmasters of censored sites (presumably
the ones that have been banned for clearly deceptive practices) that their site has been
censored and the reason for the action. The notification will be automatic and not at the
request of the webmaster. Our understanding is that Google will generally continue to
refuse to disclose the reason for censoring to webmasters that do request such
notification. This allows Google to disclose that a site has been censored for a deceptive
practice while continuing to deny that it is censoring other sites for reasons other than
deceptive practices.
The Google Sandbox
Many webmasters report that Google has a “sandbox” in which websites are confined
“for being bad” but after they have ceased the “bad” behavior, requested reinstatement,
and have been reincluded in Google's index. The site is no longer completely censored
and can be found in the index, but has an abnormally low rank, much lower than it had
before being banned. New websites are also said to be sent to the sandbox for some
period of time. Google people have generally denied the existence of a sandbox although
some agree that there is a “sandbox effect.” (Notice an interesting continuation of
the parent-child psychology here. Webmasters are often willing to see themselves as
children being “punished” for “being bad” by being sent to the “sandbox” for a
“timeout.” They are very willing to assume that if they are “banned” or “in the sandbox”,
they are “doing something wrong.”)
The sandbox effect could be partly explained by the site popularity factor. A site banned
by Google would lose traffic and therefore lose site popularity. When the site was
restored to the index, it would still have reduced traffic, and site popularity and therefore
poorer rank initially. Gradually, traffic and therefore site popularity and rank would
improve. Voila, the sandbox effect. This effect would be much more noticeable with
7
Google than with the other search engines because Google generally contributes more to
a site’s traffic. Therefore a site banned by Google would lose more traffic and site
popularity. If this were the case, we would expect to see at least somewhat proportional
reaction from the other major search engines.
However, using the age of a site (domain name) as a factor in ranking is an obvious antispam technique. Since spam sites are continuously being discovered and banned by
search engines, spammers have a continuing need for new unsullied domain names.
Therefore, a site that has been operating for a long time is much less likely to be a spam
site and it makes sense for search engines to rank newer sites lower. Spammers counter
by buying previously-owned domain names released by failed non-spam web sites in
order to get their “history” and residual incoming links. (Legitimate site owners can be
inadvertently banned because of purchasing or inadvertently duplicating a domain name
previously used by a spammer.) Search engines have apparently countered by monitoring
domain name use. If a name used to host a web site disappears (web site down) and then
reappears it might now be owned by a spammer.
Some webmasters report that their Google referred traffic suddenly dropped drastically
and never recovered, and that there was not a great correlation with other search engine
traffic. They further report a similar drop in PageRank reported by Google. This suggests
that Google is “manually” adjusting PageRank (site-unique bias) for some sites.
Other Censoring Issues
The people in a censoring department, be it in China or at a major search engine, are
probably relatively poorly paid. They sit at a monitor all day adding sites to the censored
sites list. If they exceed their quota, they might get a bonus. These people may spend time
randomly surfing around looking for sites that meet the criteria specified for censoring
but they are more likely to be reviewing sites that have been nominated for
censoring. They are not paid to make subtle and complex value judgments.
All the major search engines have a system where anybody can nominate a site for
censoring by filling out an online “spam report” form. People can nominate the sites of
their competitors or any site they don’t like for any reason. The larger the traffic a site has
and the more people that see the site, the more likely it is that someone will nominate it.
The search engines can also have mechanical means to nominate sites. For example,
tracking data8 and site popularity data can be analyzed in order to produce nominations.
Censoring for competitive or convenience reasons is important to a search engine only if
the target site has significant traffic or has a large number of pages indexed. If the site is
8
Tracking data: Data showing the web usage (pages visited, time spent on each page, time-of-day visits
were performed, subject matter of pages visited, etc.) of individual web users. Usage patterns of legitimate
users differ from click-fraud participants and can also be used to identify spam sites. Tracking data can be
obtained anytime a user visits a web site owned by a search engine or one of its advertising partners.
8
“naturally” getting very few referrals (clicks) from the search engine, then the beneficial
effect of banning that site would be minor. Many small-business site owners report
getting banned only after they achieved considerable size and popularity.
Because censoring can be performed from anywhere, the censoring department is an
obvious choice for outsourcing to an offshore location that has lower labor costs.
One problem is that censoring could obviously be used for all sorts of nefarious
purposes. A censor could decide to rent himself out to the highest bidder. If advertising is
going for $2 per click, imagine what it would be worth to censor a competitor’s site, even
temporarily, just before Christmas. Maybe one of the censors just doesn’t like
Democrats. The possibilities are endless. Even if search engine management has not, at a
corporate level, used its censoring power for such purposes, what steps have they taken to
prevent abuse by individual censors or groups of censors? If censoring is done offshore,
there are additional concerns.
All of the precautions cost money. Censored sites could be reviewed at a second level but
it would cost more. Complaints from sites that happen to notice that they have been
censored could be reviewed by people other than the group that did the original
censoring, but it would cost more. Since search engines consider all of these practices to
be trade secrets, we have no way of knowing what precautions they employ. Since search
engines consider that censoring is an exercise of their free-speech rights and that they
have no obligation to webmasters, it appears unlikely that they would pay for extra
precautions to protect web sites.
Google has recently changed their policy to re-review and consider reinstating only those
sites that stipulate in writing, in advance, that they are guilty of a deceptive practice and
have discontinued that practice. Google will no longer consider reinstating a smallbusiness site that has been banned for competitive, editorial, or convenience reasons, a
site banned for reasons not specified in their guidelines, a site banned by mistake, or a
site that does not know why it has been banned. One can easily infer that this change was
implemented because of massive complaint volume from sites banned for non-deceptive
reasons.
Search Engine Bias and Site-Unique Bias
Recall that in order to incorporate site factors such as site popularity and age-of-site into
their ranking algorithms, search engines necessarily must maintain a database of sites as
well as a database of pages. We have mainly been discussing outright censoring
(banning) in which a web site is completely excluded from access by a search engine’s
users. Outright censoring is easily detected by a site owner. All the major search engines
admit to banning on a site-by-site basis. Major search engines will, upon request,
generally confirm or deny that a particular site has been censored while often concealing
the reason for such action.
9
However, all the majors claim or at least strongly imply that their ranking and depth-ofcrawl algorithms are fairly applied equally to everybody. They say that if your site is not
censored, any sudden change in rank is due to a general change that the search engine
made in its algorithm. The same algorithm is applied to all sites handled by that search
engine.
The preceding paragraph is literally true. The algorithms are indeed applied to all sites.
However, this does not mean that there cannot be gross bias against particular individual
sites incorporated into an algorithm. For example, a ranking algorithm could easily
contain a “rule” that says: “Check if this site is on our ‘bad sites list’, if so, subtract 264
from its merit ranking value; if it is on our ‘good sites list’, add 29 to merit.” This sort of
site-unique bias would be more subtle than outright censoring and more difficult for a site
owner to prove, even if it was so severe that it was effectively impossible to find the site
in a search.
Site-unique bias is easy to incorporate into the site database that a search engine must
already maintain. Much more complex site-unique bias schemes are obviously possible.
We can define site-unique bias as a case where a search engine ranking algorithm or
depth-of-crawl algorithm contains or refers to site-unique information such as domain
names, trademarks, words, or phrases unique to a particular site, or other information
identifying particular sites in order to apply bias to the ranking or crawling of individual,
hand-selected sites. Many site owners claim that their drop in ranking at a single search
engine is so catastrophic that it must be the result of site-unique bias. (See Kinderstart
Case for a convincing instance of site-unique bias.) Because site-unique bias is more
difficult to prove than outright censoring, it represents an opportunity for search engines
to avoid some of the hassles surrounding censoring such as the need for “re-review” of
complaining censored sites. (A bias scheme that is uniformly applied to all sites (site
popularity, age-of-site) is not considered to be site-unique bias.)
Search engine bias can also be used against specific ideas as opposed to against specific
sites. Most people would consider it reasonable to rank pages and sites containing
pornographic words lower than pages not containing pornographic words, especially if
the search terms did not contain pornographic words or phrases. Most would also
consider it reasonable to rank pages containing phrases such as “site under construction”
below otherwise similar pages. Any search algorithm presumably expresses the results of
many such judgments. The same technology could also easily be used to rank pages
containing “Democratic candidate” below pages containing “Republican candidate”, an
action most Democrats would consider “unreasonable.” Because algorithms are
considered trade secrets, it is difficult (but not impossible) to determine if search engines
are employing “unreasonable” bias. As far as we can determine, political search bias is
not currently illegal.
Anti-Competitive Impact of Censoring on Small
Businesses
10
Search engine censoring departments are careful not to censor or apply major negative
site-unique bias to any site that appears to be owned by a relatively large business unless
the site employs practices clearly intended to deceive such as hidden text or doorway
pages. Large businesses have lawyers, publicists, and other ways of “pushing back” if
their site is censored for the convenience of the search engine or for competitive or
editorial reasons and the benefit to the search engine of censoring any one site is
generally relatively small. Therefore, essentially all censoring for these purposes is done
against small businesses. For example, our studies found no case in which a large
business using Open Directory data had been censored by any major search engine while
small businesses using Open Directory data or otherwise containing directories are very
frequently censored, especially by Google. (See Search Engine Censoring of Sites Using
Open Directory Data, The SeekOn Case, and The Kinderstart Case.) Amazon and other
large companies are allowed to “buy links” or use “non-original” content where small
companies are frequently censored for doing so. Search engine censoring therefore works
to suppress small businesses in favor of large businesses. This is especially unfortunate
because otherwise the web represents a major opportunity for smaller businesses.
Issues with Search Engine Editorial Policies
Is there really anything legally wrong (or even morally wrong) with a search engine
having an editorial policy? Many people would say no. Maybe a search engine has just as
much right as a newspaper or magazine to determine what information they pass on to
their users. A search engine is investing in its index just as a newspaper or magazine is
paying for each square inch of page space. Maybe a search engine should have equally
complete control over what they put in their index. We don’t expect to see articles
favorable to Newsweek or other competitor in Time. We expect publications to have an
editorial “slant” or bias. There is not even any requirement or expectation that a
newspaper or magazine disclose its editorial policies.
It should also be obvious from the discussion of “merit” ranking, algorithm bias, site
databases, and the existence of censoring departments that search engines have the builtin capability to execute any desired editorial policy. If a search engine decided to be
“right-wing” or “left-leaning”, or to suppress competition, the necessary infrastructure
already exists. There is no technical reason and apparently currently no legal reason why
the merit algorithms couldn’t easily be adjusted to favor sites about Republicans over
those about Democrats or institute any other desired point of view.
However, the existence of the capability for search engine editorial policies and strong
evidence suggesting that such policies are being implemented is very disturbing for
several reasons. In many ways a search engine is not like a newspaper or other
publication.
Unaware Audience: Anybody who has been born and raised in a free country
knows all about editorial bias in media including newspapers, magazines, radio,
and TV. “Filtering truth” when obtaining information from these sources is
second nature. However, most people do not think of editorial bias, censoring, or
11
other filtering as applicable to search engines. Search engines are seen as
mechanical devices and therefore incapable of bias. A search for a pornographic
word or phrase returns millions of hits adding to the false impression that no
censorship or editorial policy is in place. Search engines do everything they can to
enhance this impression. In our opinion, this is deceptive and dishonest.
Unfulfilled Need: Most people use search engines precisely because they are
trying to get access to the largest and most diverse body of information possible
with the least amount of editorial filtering possible. If you don’t mind having your
information filtered through some editorial filter, there are many much better
sources of information. The growth of the Internet itself was largely fueled by the
public’s desire for unedited, uncensored information. Search services that are
more openly editorial (such as Ask.com) have been relatively unsuccessful.
Absence of Diversity: In the United States alone, there are quite a few TV
networks, radio networks, newspapers, and magazines. Worldwide there are many
more. But there are only three major search engines worldwide. A very small
group of people is setting the editorial policies for these search engines. This
small group is controlling a very important source of information for a very large
number of people. (See The Web Czars.) This has important implications for the
political process in the United States and elsewhere where elections could be
influenced by search engine bias. Even worse, if the most comprehensive search
is needed, Google is rapidly becoming a sole source.
Control Without Responsibility: Publishers, while having complete editorial
control over their publications, are also responsible for what they publish. If
someone libels a person in a newspaper article, that person can sue the
newspaper. If the publication contains illegal material such as child pornography,
or promotes illegal activity, it can be shut down. Rules for publishers have been
developed during a period spanning more than 500 years.
At the same time it is understood that an organization that merely acts as a
conduit or connection service for information (Telephone Company, Internet
Service Provider (ISP), Post Office) is not responsible for the content of the
information they convey. You can’t sue the Post Office because you were taken
by mail fraud. You can’t blame the phone company if you get a threatening call.
ISPs have (so far) been able to avoid any responsibility for information they
convey or store (e.g. child pornography) as long as they do not edit or filter the
information.
However, organizations that act as information conduits (connection services) are
not allowed to pick and choose the information they carry. The phone company is
not allowed to decide, using secret internal criteria, who can have a phone or what
they can say on the phone. Any restrictions regarding who can or cannot have or
use a phone (or other connection service) must be very well documented, very
12
public, and very fairly enforced. If it were not this way, the phone companies
would be running the country.
The major search engines currently have it both ways. They are able to editorially
and using undisclosed criteria “cherry-pick” the information their users are
allowed to see while denying any responsibility for the content of that same
information. Will they be able to continue to do this indefinitely or will a court or
legislative decision eventually force either more responsibility for content or more
fair handling of information providers? We will have to wait and see. The Internet
and search engines are new technology. Law and regulation have not caught up
yet.
Essential Nature of Search: Imagine what would happen to most businesses if
their phone service suddenly disappeared. What if the phone company could
arbitrarily refuse to reconnect them for no stated reason? Maybe the phone
company blackmails the business into buying “advertising” in order to be
reconnected. Now imagine that there were only three major phone companies and
one of them controlled more than 60 percent of all phone traffic. The main way
for more than half of the people to reach you is through this company. If Time
never publishes a favorable story about a business, they can certainly live with
that. For most businesses, being disconnected by the phone company would be a
terminal event. For many, being disconnected by Google is equally terminal. This
will be increasingly true in the future.
Service Nature of Search
Many people clearly and increasingly use search as a connection service. Many
searches are for company names or other company-unique information. If
someone searches for a unique trademarked company name and that company’s
site does not appear near the top of Google results the searcher can reasonably
conclude that the company has gone out of business or is so backward that it does
not have a web site. A search for the corner gas station produces a response.
Certainly a search for any company that does any significant business would also
produce a result.
Also, in providing a “clickable link” to destination web sites, search engines
clearly provide a connection service. It is far easier to merely click on a link to
connect to the web site than correctly typing www.searchenginehonesty.com as
would be needed if, for example, you read about the site in the newspaper or other
print publication and for some reason wanted to connect without using search.
Private Nature of Communication: Publications are public. However, search
engine results are private. John Q. Public presumably does not want the fact that
he is searching for “hot blond chicks” to become public and expects privacy in the
same manner that he would expect privacy in a telephone connection or other
connection service. People making phone calls expect not only that the content of
13
the conversation is private, but also that the time and date of the call and person or
business called is private, unless obtained under court order. People using search
services to connect to web sites have the same expectations.
Individual Nature of Communication: Publications are designed for a mass
audience. However, like information conveyed in telephone conversations, search
engine results are specifically designed for the single individual that conducted
the particular search. Search results are a service not a publication.
Source of Information: A telephone company may provide the wires, software,
and other infrastructure for processing and handling your voice but you are
providing the information content when you talk on the phone. The phone
company does not own your communicated information and is not allowed to use
it for their own purposes even though it has access and technically could easily
eavesdrop or record your conversation. Similarly, search engines don't actually
provide original information in results data but only mechanically process and
handle information provided jointly by web sites and searchers. Web sites and
searchers provide this information for the express purpose of obtaining
connection. The “publisher” is actually the web site. The search engine is a
connection service. If anybody has “free-speech” rights it should be the web site.
So the 64 billion dollar question is this: Is a search engine more like a newspaper or
more like a telephone company? Is a search engine providing a publication or a
connection service? In minor court cases fought by large-cap search engines against tiny
small businesses (e.g. Kinderstart v. Google), search engines have so far been able to
maintain the idea that search results are publications and that therefore they have freespeech editorial rights over search results. We can expect to see this question argued very
extensively in the courts as well as in the court of public opinion in the next few years.
For an illustration of search engine censoring issues see The Googlecomm Fable.
Google Defamation of Small-Business Sites
PageRank (PR) is Google’s merit factor used (along with search term relevance) in
determining the ranking of pages in Google search results. Where other search engines
internally develop similar merit factors for sites and pages they index, they do not publish
the merit factors. Google publishes the PageRank (as a number between zero (minimum
page “value”) and ten (maximum) with a corresponding length green bar) of sites listed in
their Google Web Directory, which is a clone of the Netscape Open Directory. (The only
significant difference between Google’s directory and the AOL/Netscape Open Directory
is the addition of PageRank.)
Users can also download a free Google toolbar that displays PageRank of any page being
displayed in the user's browser. Users can use the PageRank to assess the “quality” and
“importance” of the site and page they are viewing. Generally, even very low-traffic, very
minor sites have a PageRank of at least 2. Google’s home page has a PageRank of 10.
Other very popular sites (Yahoo, MSN, Excite, AOL) have a PageRank of 9.
14
Internally, Google certainly uses a PageRank numerical value that has more than 11
gradations for ranking search results. The displayed, 0 - 10, PageRank is thought to be a
logarithmic representation of the internal PageRank, which is said to be named after
Google cofounder Larry Page as opposed to being named for its page ranking function.
As described above, PageRank is applied to sites as well as pages.
Google tells their users (Google Technology
http://www.google.com/technology/index.html 7/2006):
“PageRank relies on the uniquely democratic nature of the web by using its vast link structure as
an indicator of an individual page's value. In essence, Google interprets a link from page A to
page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of
votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages
that are themselves 'important' weigh more heavily and help to make other pages 'important.'
Important, high-quality sites receive a higher PageRank, which Google remembers each time it
conducts a search. Of course, important pages mean nothing to you if they don't match your
query. So, Google combines PageRank with sophisticated text-matching techniques to find pages
that are both important and relevant to your search. Google goes far beyond the number of times
a term appears on a page and examines all aspects of the page's content (and the content of the
pages linking to it) to determine if it's a good match for your query."
"A Google search is an easy, honest and objective way to find high-quality websites with
information relevant to your search."
"PageRank is the importance Google assigns to a page based on an automatic calculation of
factors such as the link structure of the web."
(Google Toolbar Help http://toolbar.google.com/button_help.html 7/2006):
"Importance ranking. The Google Web Directory starts with a collection of websites selected by
Open Directory volunteer editors. Google then applies its patented PageRank technology to rank
the sites based on their importance. Horizontal bars, which are displayed next to each web page,
indicate the importance of the page, as determined by PageRank. This distinctive approach to
ranking web sites enables the highest quality pages to appear first as top results for any Google
directory category". (Google Web Directory Help http://www.google.com/dirhelp.html 7/2006)
[emphasis added]
Google’s description for their users certainly unequivocally and emphatically states that
PageRank is “honest”, “objective”, and the result of an “automatic calculation”, that is
completely dependent on external factors such as the “democratic nature of the web” and
whether the site is “highly-regarded by others” as indicated by links to the site from other
sites. However, as we have seen, there is substantial evidence that Google is using its
own, internally determined, subjective, and manually applied site-unique bias to suppress
PageRank for individual hand-picked sites.
Also, for pages on sites that are censored (banned) by Google, Google’s directory listing
and toolbar indicate a PageRank of zero (minimum). Google’s toolbar indicates “not
15
ranked by Google” as opposed to zero on those pages that have not been evaluated by
Google's system. A zero therefore means to a Google user that the page has been
evaluated by Google’s automated, honest, and objective technology and found to be
meritless and of minimum “importance” and “quality.” In actuality, a zero in the case of a
banned site means that the site has been manually banned by the Google censoring
department based on undisclosed subjective criteria and has little or nothing to do with
external factors such as how the site is regarded by others or any other objective criteria.
It therefore appears that a zero or otherwise artificially depressed PR for pages on a
manually censored or biased site, in combination with Google's description of the
automated, honest, and objective nature of PageRank, represents a knowingly false
derogatory statement (defamation) by Google regarding the suppressed web site. Google
is “adding insult to injury.”
Kinderstart sued Google, in part, based on the idea that Google is engaging in defamation
and libel of web sites that have been banned (e.g. SeekOn) or been subject to “blockage”,
arbitrary assignment of PR=0, or other arbitrarily imposed reduction in PageRank (e.g.
KinderStart).
See the Open Directory Case Study for data showing that Google sets PageRank to zero
for sites that have been banned by Google. Google admits to the practice of manually
banning sites based on undisclosed criteria.
Since Google apparently only censors or suppresses small businesses, the defamation
issue currently only impacts small businesses.
It seems that Google is phasing out their Open Directory clone. It has not been updated
since 2005.
Analysis
The major search engines have been very successful in convincing most search users that
search results are “fair”, “honest”, and “objective”, while simultaneously convincing
courts that search results are merely “opinions” of an editorial entity. If practices such as
those described here were to become common knowledge among search users, or worse
yet, the majority of users began to see search engines as editorial entities, all the major
search engines might be adversely affected.
People select editorial entities (newspapers, TV news, etc.) based at least somewhat on
the editorial slant of the publication. Conservatives are more likely to read the Wall Street
Journal, watch Fox News, and so forth. If people began to choose search engines as they
choose newspapers or TV news sources, then organizations such as Fox News and the
New York Times would be encouraged to set up their own search engines based on their
own already widely respected editorial policies. This would likely adversely affect the
existing search triad.
16
Suppose Google entered into a relationship with Barnes and Noble and then censored all
of Amazon’s pages (and their partner site pages) from the Google index, apparently a
completely legal9 and legitimate business maneuver. The resultant publicity would act to
destroy the public illusion that Google was not an editorial entity and lead to the scenario
described above. This promotes a situation in which censoring and major site-unique bias
actions by search engines are only taken against small businesses.
Similarly, in order to conceal the double standard regarding their treatment of smallbusiness sites vs. large-business sites, and the general existence of editorial policies,
search engines are unable to publish detailed guidelines for webmasters. This produces
widespread confusion among small-business webmasters regarding the rules, most
specifically concerning what constitutes “black hat” or “white hat” optimization. Search
engines therefore represent a continuing and critical uncertainty for small businesses.
Search engines (especially Google), aided by the dearth of published rules, have also
been very successful in convincing small-business web site owners that catastrophic rank
problems or complete disappearance of their site from search results is the consequence
of some site configuration error or inadvertent violation by the webmaster of some arcane
and unpublished rule. Site owners have an enormous incentive to believe that their
problem has such an easily fixable cause as opposed to the idea that their site was banned
because it was seen as competition or otherwise has a design feature only allowed in
large-business sites, an unfixable situation. It can literally take years to determine the
truth through a trial-and-error process.
The massive and continuing onslaught of spam limits the resources search engines can
reasonably be expected to expend on a per-site basis to determine whether a smallbusiness site is or is not spam and leads to a shotgun approach. See Does Google Ban
Large Sites by Small Businesses? for more discussion.
To an unprecedented extent, a search engine’s infrastructure is hidden from users. They
all have similar simple entry pages into which one enters search terms. The result pages
look similar. A search engine could be constructed for .01 percent of Google’s capital
investment and the only significant difference between this engine and Google would be
in the quality of results, an almost completely subjective parameter. Results of the cheap
alternative might well be adequate for entertainment purposes. Google at about 60
percent of searches is not a monopoly. Google can point to hundreds of search engines
that superficially resemble Google. However, Google has many advantages including
massive infrastructure investment that allow it to deliver qualitatively better results than
any competitor. People who need high-quality search for research, business, education, or
other more serious applications are therefore in an increasingly sole-source situation. See
Is Google an Unbreakable Monopoly? for more discussion.
9
The author is not a lawyer. Communications law is a very complex area.
17
The major search engines would likely be the greatest corporate victims of any loss of net
neutrality10. The first move any ISP would make would be to put up their own search
engine and suppress access to the major search engines. The search engines and their
supporters are now in the position of having to claim that ISP censoring or suppression is
bad but, somehow, search engine censoring and suppression is OK! Functionally, the
effect on end-users or businesses is very similar.
Conclusions
Search engines are legally seen as editorial entities but the reality is that they perform
connection services that are increasingly essential to successful operation of any public
web site. Web sites in turn are increasingly essential to the operation of most businesses.
Search engine editorial policies heavily discriminate against small businesses.
In view of their editorial-entity status, search engines can legally and properly institute
editorial policies that incorporate political bias or any other bias that would be acceptable
in other publishing such as newspapers and TV networks. Because of the very small
number of significant search engines, because of Google’s general dominance, and
because of Google’s functional monopoly in the “high-quality” search area, this
represents a dangerous loss of editorial diversity. If search engines are to continue to be
considered editorial entities, the public should be educated to that effect and the appalling
lack of editorial diversity must be addressed. If search engines are eventually determined
to be connection services then they should be constrained by regulations similar to those
that apply to other connection services.
Unless this situation is corrected by legislative or judicial action, small business and
freedom of information will be very adversely affected.
Any solution to the net neutrality problem should address these search engine issues.
Additional Information
Azinet operates a web site for discussion of search engine issues at
http://www.searchenginehonesty.com/ . See this site for additional information including
the following reports:
The Kinderstart Case
Search Engine Censoring of Sites Using Open Directory Data
Is Google an Unbreakable Monopoly?
Search Engine Mechanics
Search Engine Webmaster Guidelines
10
Net Neutrality Issue: Should cable companies, telephone companies, and other Internet Service
Providers be allowed to suppress or degrade their customer’s Internet communications with specific handpicked destinations relative to other destinations?
18
Copyright © 2007 Azinet LLC AZI:2007:103
19

Download Report

Impact of Search Engine Editorial Policies on Small Business and

Paperzz.com

Your Paperzz