Web Host Enumeration Through DNS

Domenico Dato, dato@di.unipi.it

Giuseppe Attardi, attardi@di.unipi.it

SerRA (Servizi di Rete Ateneo)

Dipartimento di Informatica, Università di Pisa

Pisa, Italy

Abstract: Most search engines exploit spiders to implement the information gathering process. We present a technique to reduce the memory resources required by such spiders, improve the coverage and to provide for scaling through parallelization. The technique exploits the direct enumeration of Internet hosts through Domain Name Server. Tool bases on this technique have been successfully applied in a search engine for the ".it" domain but can be applied directly to other domains. We report statistics on the results of the use of our tools in this context.

World Wide Web and Robots

Web robots are tools used to gather data from the Web [De Bra 1996]. Robot behaviour may be formally described by means of an oriented graph G = (N, A), where N is a finite set of nodes, corresponding to Web documents uniquely specified through their URL, and A is the set of arcs which represent unidirectional pointers, corresponding to the links between documents. G does not have to be fully interconnected since there can be URLs not reachable from other URLs: on Internet there can be a Web server completely autonomous defining a subgraph without any incoming arcs. A more appropriate model for the highly irregular World Wide Web is hence a set of connected components.

The task of a robot is to visit each node in G once avoiding to traverse twice the same path. This is done by building a sequence of graphs each one representing a more precise approximation to the whole graph G.

The two main visit techniques are: the Depth-First-Search (DFS) and the Breadth-First-Search (BFS). With the first method, when a node is visited, we go away from it as much as we can until a blind alley is found (a node v all of whose adjacent nodes have already been visited). The order of visit is FIFO and the algorithm may, as a consequence, activate a recursive process whose depth is related to the length of the most extended not-cyclic path present on the Web. The second method, instead, visits the nodes according to the growing distance d from the start node r, where the distance between r and the generic node v is the length of the shortest path from r to v. The order of visit followed is, in this case, LIFO.

Often the two techniques are combined: a set of starting points for the search is built by performing a BFS with maximum distance from a root, and then DFS is applied from this set. All the other methods are variants of combinations of BFS and DFS.

Note that the whole graph G is not known a priori. New nodes are added each time an already acquired arc is followed, building a new graph G_i from the given G_i_-1.

An analysis of the above algorithms leads us to reveal some limits:

Paths Memory: to guarantee the correctness of the visit algorithm, an auxiliary structure must be used (generally a database) to hold every already visited node (every already reached URL). The maintenance of this structure is particularly time consuming because the quantity of documents actually available on the Web is considered to be approximately 40 millions pages.
Low scalability: the Web indexing process is well suited to be executed in a parallel way by mean of a prefixed number R of robots that follow the network on different parts of the graph G used as a model. In De Bra [DeBra 1996] calculates that, given a starting base of twenty millions Web documents (containing about a thousand gigabytes of text) and with an average transfer rate of 3 Kbyte/s, the time that must be used to fetch all the information with a single robot is equal to about 8 months. The same task, realized trough ten distinct robots, may be accomplished in about 20 days. By adopting a parallel indexing architecture, a new class of problems is introduced. First at all, because of the fact that the structure isn't known a priori, the structure in which the URL are stored must be distributed among the robots. Every visited document is stored with a write disk access; every time a document is fetched a check must be performed, by using a read disk access, to be sure that this hasn't already been done by another robot. More robots are present, more probable is the probability of use conflicts.
Balance Absence: in order to achieve good performance levels, an architecture based on a high robot parallelization level must use a balance policy for assigning work.
Root identification: since G is not fully connected, a full search requires selecting at least one root node within each connected component. Connectivity within each component is low, as demonstrated by statistical survey performed within the RBSE project [Eichmann 1994] showing that 59% of Web documents have only one link and 96% less then five links. Therefore the choice of the root within each component it is quite critical.

We present some tools we developed for enumerating hosts and facilitating Web search. These tools are used in ARIANNA[], an Internet search engine for Web sites with italian language content. Since some hosts with italian language content do not appear under the ".it" DNS domain, several heuristics, whose discussion is beyond the scope of this paper, have been employed in order to find them.

Direct WWW Enumeration

The method we implemented exploits the fact that many problems that affect the Web visit algorithms, may be solved by building a priori an URLs list that must be the widest possible. Each item is used as a starting point for a distinct robot. Similar techniques have already been proposed and a list of them is presented in [DeBra 1996]. However, most of these are ad hoc and not general enough. For example one suggestion is to survey Usenet News postings in order to discover URLs. An important aspect of the method we propose is to be parametric: different selection criteria give rise to a spectrum of solutions. A very selective criterion, even though it reduces the indexing time, could discard valid hosts; a weak selective criterion, on the other hand, could show the opposite behaviour.

A hierarchical selection criterion based on DNS name

The WWW structure is inherently chaotic. The final aim of BFS and DFS algorithms is to build a spanning tree for the connected components of the Web. A spanning tree suggests the idea to find a way of hierarchical ordering net hosts. Luckily, this ordering can be derived from the domain subdivision provided by the DNS system [Albiz & Liu 1996]. Using tools like host or nslookup and directly querying authoritative nameservers one can find the set of host names H(D_i) = {h_d1, …, h_di} inside the domain D_i. Let D = {D₁, …, D_z} be the set containing all Internet domains and let H be set of all Internet hosts defined as follows:

Let n = #(H), R robots (with R << n) can be used to examine the n found hosts in parallel using DFS or BFS techniques (or their variants). Since not all Internet hosts have a WWW server, assigning to a robot a machine that contains nothing to be indexed is a waste of time and could invalidate any balancing policy. The compromise we adopted is to preselect all Internet hosts having a canonical name (record A) or an alias (CNAME) among those most commonly used for machines running WWW services. In detail, we define a criterion K according to which a host becomes candidate to be indexed: for instance we can test whether its name contains "www", "web", "w3" as substrings. Let W be the set of all hosts satisfying K:

Let Q = (P₁, … , P_R) be a partition of W: the indexing domain DI(r), assigned to a robot (with 1 r R) is defined as .

Our method has the following properties:

path memory of robot r is limited by the number of Web documents contained in Web servers indexed by r.
good scalability with respect to the number of used robots. Since P₁, … , P_R are disjoint there is no need to keep shared structures to record the paths already followed by each robot.
the choice of different substrings (new definition of K criterion) allows us to modify the cardinality of W.

A further optimization allows avoiding direct query to an authoritative nameserver for each domain in order to build the set H. To reach this aim the RIPE[] monthly survey (hostcount) is exploited. This has the advantage of saving bandwidth: one avoids building the Italian domain topology since it is already available on Internet. [Tab. 1] reports the results obtained applying criterion K on the hostcount for the italian domain. However, this improvement has introduced two problems. Let H_RIPE be the host used by RIPE to survey the net and let NS(D) = {NSP(D), NSS₁(D)₎ …, NSS_j(D)} be the set formed by the primary nameserver and by j secondary nameservers for domain D. It can happen that:

a query made from H_RIPE to all nameservers in NS(D) is not answered, for instance because of a time-out.

the policy of a domain forbids zone retrieval from H_RIPE.

If one of these conditions occurs during analysis of domain D_i, hostcount omits from the survey both D_i and all domains delegated by the authoritative nameservers for D_i. Note that (1) and (2) might not happen if the queries to NS(D) are performed from a host different from H_RIPE. In such cases, one could discover the domain structure even if it is not present in hostcount. Statistical studies (reported in [Tab. 2] ) justify this corrective intervention: in fact for domain ".it" our method counts a number of machines that is more than 10% greater than that contained in the RIPE database.

Once the set of the hosts in W has been built, in order to determine those which are actual Web servers, we have developed testwww, an agent which tests the presence of a HTTP server on a set of TCP ports chosen among those generally used to provide such service. testwww was tested on more than 14000 Italian hosts [Tab. 3]. Let T be the set built from W using testwww:

We then partition T into R subsets (P₁, … , P_R) as above and assign a robot r to each P_i= DI(r). During the analysis of host h DI(r), the path memory for robot r is limited to the documents contained in h. This provides scalability for performing parallel indexing in a search engine.

Besides, testwww allows us to build statistics on [Tab. 4]:

the choice of ports used for the WWW service;
the kind of Web server implementation;
other services offered beyond WWW (such as Proxy-Server).

Document Relocation

Some HTTPD daemons provide several virtual Web servers on the same IP address. This is known as VIRTUAL HOSTING: the administration job is centralized while the content of each virtual site is under the responsibility of its respective owner. A single IP address is used but each virtual site has assigned a CNAME (or an A record on the same IP). Virtual Hosting was standardized in version 1.1 of HTTP protocol [Berners-Lee, Fielding & Frystyk 1996]. This requires a revision of robot's visiting algorithms. Indeed, if path memory is managed only according already visited IP addresses, documents with the same document path but located on distinct virtual servers will be missed.

We solved this problem using a heuristic: the testwww agent, given a set of A and CNAME records N(host) = { A₁, … , A_p , C₁, … , C_q } associated via DNS to net address I_host, contacts (p + q) times the HTTP server on I_host asking for the Document Root, with a different argument for the "Host:" directive chosen from N(host). For each returned page, testwww computes a MD5 RSA [Schneier 1996] signature. Two Web servers are considered distinct if and only if they have different associated signatures. This is well represented by the set M:

Robots' Load Balancing

To perform the partitioning of hosts into indexing domains we decided to exploit the division into Autonomous Systems (AS). The idea is to minimize the number of border gateways (according to BGP4 terminology [Huitema 1995]) crossed during the indexing phase in order to reduce the probability of using links with high traffic load.

For the case of the italian domain we chose three indexing hosts, each one in a different AS. The prtraceroute tool [PRIDE 1996] tells us to for each host, both which AS it belongs to and how many AS are crossed to reach it. We then filter static information given by the RIPE Registration Authority (via whois) in order to assign hosts to the indexing points so that the number of crossed border gateways is minimized and therefore the indexing time is reduced.

Open Research Areas

In summary our method has the following features:

reduced indexing time since each robot has a preassigned task to be accomplished;
reduced use of resources because there is no need to maintain shared structures;
potential for parallelism in terms of robots that can be used;
possibility of assigning work according to a policy of load balancing.

As it may be expected there also are some limits: the most evident one is that we miss Web servers not satisfying the K criterion. Several approaches are possible to mitigate this drawback although we expect it not to be significant. Statistical data (reported in [Tab. 5]) show that we collect a remarkable number of URLs compared to similar engines based on traditional spider mechanisms: ARIANNA is currently the most complete Italian search engine in terms of indexed information.

A first method to discover a host not satisfying the criterion K is based on a different way of integrating DNS enumeration techniques with visit algorithms. Once an indexing domain DI(r) = { H₁, … , H_l } has been built, each robot is allowed to cross not only the host H_i (with i < l) but also all hosts in the same DNS domain. In this case the path memory of each robot is now limited to WWW pages contained in the Web servers of the examined DNS domain. Careful partition that doesn't assign to different robots the same DNS domain allows us to reach scalability for parallel indexing. Alternatively the method may use as a boundary the IP classes.

A second method is based on post processing the already collected Web pages when the indexing phase is ended. This is done in order to discover news URLs not enumerated in the set W. Note that this is a local job (not involving communication) and as consequence less time-consuming.

Experimental Results

We report statistical data which refer to the use of our tools within the ARIANNA search engine.

Criterion applied on hostcount Month Res. Month Res.

Number of tested domain Dec 96 6995 Feb 97 8608

WWW servers discovered Dec 96 6932 Feb 97 8573

Tab 1: Criterion applied on hostcount

hostcount Problems	Month	Res .	Month	Res.
Unanswered queries	Dec 96	646	Feb 97	464
AXFR negations	Dec 96	248	Feb 97	364
Wrong answers from NS	Dec 96	484	Feb 97	401

Tab 2: hostcount Problems

Direct DNS query	Month	Res.	Month	Res.
Tested domains	Dec 96	1231	Feb 97	738
WWW servers discovered	Dec 96	513	Feb 97	134

Tab 3: Direct DNS query

testwww	Month	Res.	Month	Res.
Examined domains	Dec 96	8228	Jul 97	14431
Examined hosts	Dec 96	7440	Jul 97	13780
Hosts with an active Web server	Dec 96	6741	Jul 97	11045
Proxy servers discovered	Dec 96	424	Jul 97	n.a.

Tab 4: testwww Results

ARIANNA	Month	Res.	Month	Res.
Total number of indexed sites	Dec 96	6.629	Jul 97	8.134
Total number of reached URLs	Dec 96	1.351.442	Jul 97	2.095.318
Robots' disk space used (KB)	Dec 96	10.151.481	Jul 97	7.558.919
Information Retrieval's disk space (KB)	Dec 96	6.932.135	Jul 97	9.879.552
Max object's num for site	Dec 96	17.187	Jul 97	127.173
Average time for site	Dec 96	2 hours	Jul 97	2 hours
Total time ARIANNA	Dec 96	15 days	Jul 97	20 days

Tab 5: ARIANNA Results (thanks to OTM Interactive Telemedia Labs for these data)

Acknowledgements

The work described here was done within the cooperation between SerRA, Università di Pisa and OTM Interactive Telemedia Labs. We would like to thank A. Converti, L. Madella, A. Vaccari and G. Antichi of the OTM labs for their support.

References

[Albiz and Liu 1996] Albitz, P., & Liu, C. (1996). DNS and BIND 2nd Edition. O'Reilly & Associates.

[PRIDE 1996] PRIDE - Policy based Routing Implementation Deployment in Europe, ftp://ftp.ripe.net/pride/

[DeBra 1996] De Bra, P.M.E (1996). Introduction to WWW Robot Technology, 5th International World Wide Web Conference. O'Reilly & Associates.

[Eichmann 1994] Eichmann, D. (1994). The RBSE Spider-Balancing Effective Search Against Web Load, First WWW Conference.

[Berners-Lee, Fielding and Frystyk 1996] Berners-Lee T., Fielding R., Frystyk H. (1996). Hypertext Transfer Protocol - HTTP/1.1 - RFC 2068.

[Huitema 1995] Hiutema, C. (1995). Routing In The Internet. Prentice Hall.

[Schneier 1996] Schneier, B. (1996). Applied Cryptography, John Wiley, 2nd edition.

Criterion applied on hostcount	Month	Res.	Month	Res.
Number of tested domain	Dec 96	6995	Feb 97	8608
WWW servers discovered	Dec 96	6932	Feb 97	8573