Gathering Pages
|
- An indexing program called a spider visits a web site
page, usually one submitted by the site itself, often the site
root page.
- The spider visits links accessible on the site from a page, up
to a certain depth of links.
| Page Name |
Page Contents |
| Syllabus.htm |
Important words from <a href="Ch1.htm">Chapter 1</a>.
Important words from <a href="Ch2.htm">Chapter
2</a>
Important words from <a href="Ch3.htm">Chapter
3</a> |
| Ch1.htm |
Important words from <a href="HW1.htm">Homework 1</a> |
| HW1.htm |
Important words from Homework 1. Design of a
website. |
| Ch2.htm |
Important words from <a href="Hw2.htm">Homework 2</a> |
| HW2.htm |
Important words from Homework 2.
Implementation of a website. |
| Ch3.htm |
Important words from <a href="HW3.htm">Homework 3</a> |
| HW3.htm |
Important words from Homework 3. |
- Important words on each page are indexed into a search
database for later reference during a search. Important words are
the title and unusual words that may distinguish this page from
others.
- Noise words occur so commonly as to be useless in
distinguishing one page from another, words such as: the, a,
and.
- Each word in the search database points back to the pages
where the word was found, used for later retrieval.
The inverted index of the database
for the above pages would include:
| Word |
Links |
| design |
HW1.htm |
| implementation |
HW2.htm |
| website |
HW1.htm HW2.htm |
- Some search engines use the number of links pointing to the
page from other pages to determine page rank (e.g. Google.com).
The higher number of in-links the presumed higher popularity or
importance of the page. Inter-linked pages are assumed to form an
authoritative group and given a higher rank than pages outside the
group (e.g. Teoma.com).

|
| |
Search
|
- A query is the set of words for a search engine such as Google
to locate.
- The search engine matches query words with important words
indexed from pages.
- The result returned is the list of pages where the query words
were found.
- Ranking may be done based on page popularity, number of words
matched, or other measure.
|
|