**Abstract: **Most search
engines exploit spiders to implement the information gathering
process. We present a technique to reduce the memory resources
required by such spiders, improve the coverage and to provide
for scaling through parallelization. The technique exploits the
direct enumeration of Internet hosts through Domain Name Server.
Tool bases on this technique have been successfully applied in
a search engine for the ".it" domain but can be applied
directly to other domains. We report statistics on the results
of the use of our tools in this context.

Web robots are tools used to gather data from the
Web [De Bra 1996]. Robot behaviour may be formally described by
means of an oriented graph *G = *(*N, A*), where *N*
is a finite set of nodes, corresponding to Web documents uniquely
specified through their URL, and *A* is the set of arcs which
represent unidirectional pointers, corresponding to the links
between documents. *G* does not have to be fully interconnected
since there can be URLs not reachable from other URLs: on Internet
there can be a Web server completely autonomous defining a subgraph
without any incoming arcs. A more appropriate model for the highly
irregular World Wide Web is hence a set of connected components.

The task of a robot is to visit each node in *G*
once avoiding to traverse twice the same path. This is done by
building a sequence of graphs each one
representing a more precise approximation to the whole graph *G*.

The two main visit techniques are: the Depth-First-Search
(DFS) and the Breadth-First-Search (BFS). With the first method,
when a node is visited, we go away from it as much as we can until
a blind alley is found (a node *v* all of whose adjacent
nodes have already been visited). The order of visit is FIFO and
the algorithm may, as a consequence, activate a recursive process
whose depth is related to the length of the most extended not-cyclic
path present on the Web. The second method, instead, visits the
nodes according to the growing distance *d* from the start
node *r*, where the distance between *r* and the generic
node *v* is the length of the shortest path from *r*
to *v*. The order of visit followed is, in this case, LIFO.

Often the two techniques are combined: a set of starting
points for the search is built by performing a BFS with maximum
distance from a root, and then DFS is applied from this set. All
the other methods are variants of combinations of BFS and DFS.

Note that the whole graph *G* is not known a
priori. New nodes are added each time an already acquired arc
is followed, building a new graph *G _{i}* from the
given

An analysis of the above algorithms leads us to reveal some limits:

*Paths Memory*: to guarantee the correctness of the visit algorithm, an auxiliary structure must be used (generally a database) to hold every already visited node (every already reached URL). The maintenance of this structure is particularly time consuming because the quantity of documents actually available on the Web is considered to be approximately 40 millions pages.*Low scalability*: the Web indexing process is well suited to be executed in a parallel way by mean of a prefixed number R of robots that follow the network on different parts of the graph*G*used as a model. In De Bra [DeBra 1996] calculates that, given a starting base of twenty millions Web documents (containing about a thousand gigabytes of text) and with an average transfer rate of 3 Kbyte/s, the time that must be used to fetch all the information with a single robot is equal to about 8 months. The same task, realized trough ten distinct robots, may be accomplished in about 20 days. By adopting a parallel indexing architecture, a new class of problems is introduced. First at all, because of the fact that the structure isn't known a priori, the structure in which the URL are stored must be distributed among the robots. Every visited document is stored with a write disk access; every time a document is fetched a check must be performed, by using a read disk access, to be sure that this hasn't already been done by another robot. More robots are present, more probable is the probability of use conflicts.*Balance Absence*: in order to achieve good performance levels, an architecture based on a high robot parallelization level must use a balance policy for assigning work.*Root identification*: since*G*is not fully connected, a full search requires selecting at least one root node within each connected component. Connectivity within each component is low, as demonstrated by statistical survey performed within the RBSE project [Eichmann 1994] showing that 59% of Web documents have only one link and 96% less then five links. Therefore the choice of the root within each component it is quite critical.

We present some tools we developed for enumerating hosts and facilitating Web search. These tools are used in ARIANNA[], an Internet search engine for Web sites with italian language content. Since some hosts with italian language content do not appear under the ".it" DNS domain, several heuristics, whose discussion is beyond the scope of this paper, have been employed in order to find them.

The method we implemented exploits the fact that many problems that affect the Web visit algorithms, may be solved by building a priori an URLs list that must be the widest possible. Each item is used as a starting point for a distinct robot. Similar techniques have already been proposed and a list of them is presented in [DeBra 1996]. However, most of these are ad hoc and not general enough. For example one suggestion is to survey Usenet News postings in order to discover URLs. An important aspect of the method we propose is to be parametric: different selection criteria give rise to a spectrum of solutions. A very selective criterion, even though it reduces the indexing time, could discard valid hosts; a weak selective criterion, on the other hand, could show the opposite behaviour.

The WWW structure is inherently chaotic. The final
aim of BFS and DFS algorithms is to build a spanning tree for
the connected components of the Web. A spanning tree suggests
the idea to find a way of hierarchical ordering net hosts. Luckily,
this ordering can be derived from the domain subdivision provided
by the DNS system [Albiz & Liu 1996].
Using tools like *host* or *nslookup* and directly querying
authoritative nameservers one can find the set of host names *H*(*D _{i}*)
= {

Let *n* = #(*H*), *R* robots (with
*R << n*) can be used to examine the *n* found
hosts in parallel using DFS or BFS techniques (or their variants).
Since not all Internet hosts have a WWW server, assigning to a
robot a machine that contains nothing to be indexed is a waste
of time and could invalidate any balancing policy. The compromise
we adopted is to preselect all Internet hosts having a canonical
name (record A) or an alias (CNAME) among those most commonly
used for machines running WWW services. In detail, we define a
criterion *K* according to which a host becomes candidate
to be indexed: for instance we can test whether its name contains
"www", "web", "w3" as substrings.
Let *W* be the set of all hosts satisfying *K*:

Let *Q* = (*P*_{1}, … , *P _{R}*)
be a partition of

Our method has the following properties:

- path memory of robot
*r*is limited by the number of Web documents contained in Web servers indexed by*r*. - good scalability with respect to the number of
used robots. Since
*P*_{1}, … ,*P*are disjoint there is no need to keep shared structures to record the paths already followed by each robot._{R} - the choice of different substrings (new definition
of
*K*criterion) allows us to modify the cardinality of*W*.

A further optimization allows avoiding direct query
to an authoritative nameserver for each domain in order to build
the set *H*. To reach this aim the RIPE[] monthly survey
(*hostcount*) is exploited. This has the advantage of saving
bandwidth: one avoids building the Italian domain topology since
it is already available on Internet. [Tab. 1] reports the results
obtained applying criterion *K* on the *hostcount* for
the italian domain. However, this improvement has introduced two
problems. Let *H _{RIPE}* be the host used by RIPE
to survey the net and let

- a query made from
*H*to all nameservers in_{RIPE}*NS*(*D*) is not answered, for instance because of a time-out.

- the policy of a domain forbids zone retrieval
from
*H*._{RIPE}

If one of these conditions occurs during analysis
of domain *D _{i}*,

Once the set of the hosts in *W* has been built,
in order to determine those which are actual Web servers, we have
developed *testwww*, an agent which tests the presence of
a HTTP server on a set of TCP ports chosen among those generally
used to provide such service. *testwww *was tested on more
than 14000 Italian hosts [Tab. 3]. Let *T* be the set built
from *W *using *testwww*:

We then partition *T* into *R* subsets
(*P _{1}*, … ,

Besides, *testwww* allows us to build statistics
on [Tab. 4]:

- the choice of ports used for the WWW service;
- the kind of Web server implementation;
- other services offered beyond WWW (such as Proxy-Server).

Some HTTPD daemons provide several virtual Web servers
on the same IP address. This is known as VIRTUAL HOSTING: the
administration job is centralized while the content of each virtual
site is under the responsibility of its respective owner. A single
IP address is used but each virtual site has assigned a CNAME
(or an A record on the same IP). Virtual Hosting was standardized
in version 1.1 of HTTP protocol [Berners-Lee,
Fielding & Frystyk 1996]. This requires
a revision of robot's visiting algorithms. Indeed, if path memory
is managed only according already visited IP addresses, documents
with the same document path but located on distinct virtual servers
will be missed.

We solved this problem using a heuristic: the *testwww
*agent, given a set of A and CNAME records *N*(*host*)
= { *A _{1}*, … ,

To perform the partitioning of hosts into indexing
domains we decided to exploit the division into Autonomous Systems
(AS). The idea is to minimize the number of border gateways (according
to BGP4 terminology [Huitema 1995])
crossed during the indexing phase in order to reduce the probability
of using links with high traffic load.

For the case of the italian domain we chose three
indexing hosts, each one in a different AS. The *prtraceroute*
tool [PRIDE 1996] tells
us to for each host, both which AS it belongs to and how many
AS are crossed to reach it. We then filter static information
given by the RIPE Registration Authority (via *whois*) in
order to assign hosts to the indexing points so that the number
of crossed border gateways is minimized and therefore the indexing
time is reduced.

In summary our method has the following features:

- reduced indexing time since each robot has a preassigned task to be accomplished;
- reduced use of resources because there is no need to maintain shared structures;
- potential for parallelism in terms of robots that can be used;
- possibility of assigning work according to a policy of load balancing.

As it may be expected there also are some limits:
the most evident one is that we miss Web servers not satisfying
the *K *criterion. Several approaches are possible to mitigate
this drawback although we expect it not to be significant. Statistical
data (reported in [Tab. 5]) show that we collect a remarkable
number of URLs compared to similar engines based on traditional
spider mechanisms: ARIANNA is currently the most complete Italian
search engine in terms of indexed information.

A first method to discover a host not satisfying
the criterion *K *is based on a different way of integrating
DNS enumeration techniques with visit algorithms. Once an indexing
domain *DI*(*r*) = {* H _{1}*, … ,

A second method is based on post processing the already
collected Web pages when the indexing phase is ended. This is
done in order to discover news URLs not enumerated in the set
*W*. Note that this is a local job (not involving communication)
and as consequence less time-consuming.

We report statistical data which refer to the use
of our tools within the ARIANNA search engine.

Criterion applied on hostcount | Month | Res. | Month | Res. |

6995 | 8608 | |||

6932 | 8573 |

hostcount Problems | Month | Res . | Month | Res. |

646 | 464 | |||

248 | 364 | |||

484 | 401 |

Direct DNS query |
Month | Res. | Month | Res. |

1231 | 738 | |||

513 | 134 |

testwww |
Month | Res. | Month | Res. |

8228 | 14431 | |||

7440 | 13780 | |||

6741 | 11045 | |||

424 |

ARIANNA | Month | Res. | Month | Res. |

6.629 | 8.134 | |||

1.351.442 | 2.095.318 | |||

10.151.481 | 7.558.919 | |||

6.932.135 | 9.879.552 | |||

17.187 | 127.173 | |||

2 hours | 2 hours | |||

15 days | 20 days |

The work described here was done within the cooperation between SerRA, Università di Pisa and OTM Interactive Telemedia Labs. We would like to thank A. Converti, L. Madella, A. Vaccari and G. Antichi of the OTM labs for their support.

[Albiz and Liu 1996] Albitz, P., & Liu, C. (1996).
DNS and BIND 2nd Edition. O'Reilly & Associates.

[PRIDE 1996] PRIDE - Policy based Routing Implementation
Deployment in Europe, ftp://ftp.ripe.net/pride/

[DeBra 1996] De Bra, P.M.E (1996). Introduction to
WWW Robot Technology, 5th International World Wide Web Conference.
O'Reilly & Associates.

[Eichmann 1994] Eichmann, D. (1994). The RBSE Spider-Balancing
Effective Search Against Web Load, First WWW Conference.

[Berners-Lee, Fielding and Frystyk 1996] Berners-Lee
T., Fielding R., Frystyk H. (1996). Hypertext Transfer Protocol
- HTTP/1.1 - RFC 2068.

[Huitema 1995] Hiutema, C. (1995). Routing In The
Internet. Prentice Hall.

[Schneier 1996] Schneier, B. (1996). Applied Cryptography, John Wiley, 2nd edition.