A346 - Search

Modified

Table
of
Contents

Other resources


 

How Search Engines Work

 

Gathering Pages

  • An indexing program called a spider visits a web site page, usually one submitted by the site itself or the default homepage.

  • The spider visits links accessible on the site from a page, up to a certain depth of links.
Page Name Page Contents
Syllabus.htm Important words from <a href="Ch1.htm">Chapter 1</a>.
Important words from <a href="Ch2.htm">Chapter 2</a>
Important words from <a href="Ch3.htm">Chapter 3</a>
Ch1.htm Important words from <a href="HW1.htm">Homework 1</a>
HW1.htm Important words from Homework 1. Design of a website.
Ch2.htm Important words from <a href="Hw2.htm">Homework 2</a>
HW2.htm  Important words from Homework 2. Implementation of a website.
Ch3.htm Important words from <a href="HW3.htm">Homework 3</a>
HW3.htm Important words from Homework 3.
  • Important words on each page are indexed into a search database for later reference during a search. Important words are the title and unusual words that may distinguish this page from others. 
  • Noise words occur so commonly as to be useless in distinguishing one page from another, words such as: the, a, and.
  • Each word in the search database points back to the pages where the word was found, used for later retrieval. The database for the above pages would include:
Word Links
design HW1.htm
implementation HW2.htm
website HW1.htm HW2.htm
  • Some search engines use the number of links pointing to the page from other pages to determine page rank (e.g. Google.com). The higher number of in-links the presumed higher popularity or importance of the page. Inter-linked pages are assumed to form an authoritative group and given a higher rank than pages outside the group (e.g. Teoma.com).

 

 

Search

  • A query is the set of words for a search engine such as Google to locate. 

  • The search engine matches query words with important words indexed from pages.
  • The result returned is the list of pages where the query words were found.
  • Ranking may be done based on page popularity, number of words matched, or other measure.

 

Search
Engine
Promotion

Sites pay to be
ranked highly
by search
engines. 

 

Site Submission

Most Web sites are found through search engines. If the search engines don't know about your site, no one else likely will either.

  • Part of the cost of site promotion may be buying a listing or a position on a search engine. To see what positioning does try searching for computers on Google.
  • Your site needs to be re-spidered after every major change.
  • Most search engines accept submissions of URLs to your site. Visit Lycos site.
  • Since manual submissions would take a long time and need to be redone often, service companies will submit your site to dozens of engines for a fee.
 

Robots

  • Robots (spiders) can impose a large demand on your server when pages are visited and index files that should not be.
  • Robot.txt file contains requests to limit access to only a certain part of your site. For example, you may not want form pages to collect credit card information to be indexed.

 

Optimizing Search 

 

 

What Spiders Index

Spiders pay more attention to some parts of your pages than others. Specifically:

  • <Title>Search Tags</Title> The words inside the title are considered very important by search engines to define what the page is about. 
  • <Meta Name="Description" Content="A description of the page is often used by the search engine for indexing and returned to the searcher with a link to the page">
  • <Meta Name="Keywords" Content="indexing tags meta-tag"> Generally should be highly recognizable words related to the topic or alternative spellings, etc. Keywords should be useful but have been so abused by promoters that most search engines ignore.
  • <Body>About the first 200 or so words of the body.</Body>
  • <H1> Heading words are often ranked by the level number. </H1>
The query:
important search engine


would produce the search results from the above tags. For engines returning text surrounding the query terms:

1. Search Tags
words inside the title are considered very important by search engines to define
http://homepages.ius.edu/rwisman/A346/html/wd7.htm

or for those that use the Description meta tag:
 
1. Search Tags
A description of the page is often used by the search engine for indexing and returned to the searcher with a link to the page.
http://homepages.ius.edu/rwisman/A346/html/wd7.htm

 
 

What Spiders Often Ignore

  • Noise words and numbers. The words of "The 50 computer systems are all OK." would all be completely ignored.
  • Graphics, spiders only see text. May index the caption or alternative text. Does it make more sense to spend big money on graphics or content?

Mesh Web architecture

<img border="0" src="wd4_4.jpg" width="401" height="147" alt="Mesh Web architecture">

  • Frames, a spider will link to a frameset page that has only links to other pages and no content itself. The spider must understand frames to follow the links to the pages within the frames. Some spiders do, some don't. The following is a frameset page with links to three other pages.
<frameset rows="10%,*">
      <frame src="Menu.htm"> </frame>
      <frameset cols="25%,*">
            <frame src="Syllabus.htm"> </frame>
            <frame src="WD5.htm" > </frame>
      </frameset>
</frameset>

 

Exercise

  1. What would the visitor see in the following page?
  2. Determine the resulting database from the words and tags indexed by a spider for the following page.
http://spidersAreUs.com/howto.htm
<Head>
<Title>Search Exercise</Title>
<Meta Name="Description" Content="An exercise on spider indexing">
<Meta Name="Keywords" Content="crawler robot">
</Head>
<Body>
<H1>Exercise</H1>
About the first 200 or so words of the body are indexed by a spider.
<img border="0" src="wd4_4.jpg" width="401" height="147" alt="Mesh Web architecture">
</Body>
  1. What would the query: spider produce?
  2. What would the query: crawler produce?

 

Problems with Search 

 

 

Entry Points

Search engines merely connect query words to a page on the Web. 

  • A visitor arriving via a search engine will not often arrive through the site homepage but instead through words linked to pages.
  • It is important to have navigation links on all pages that will take the visitor to where you want them to go.
 
 

Spamming

  • Spiders often rate the importance of a word to describing a page by the number of times the word occurs. Repeating some word over and over is called spamming and may get your site banned from the search engine. Example: <title>free money free money free money free money free money ... </title>

 

Local
Search 

 

Why

  • Web search engines can bring visitors from the Web to your site but aren't useful once the visitor has arrived. 
  • Local search searches only your site and not the rest of the Web. 
  • For local search your own or a hosted search engine is needed.
 

How

A number of companies offer free search engine hosting. 

  • IU offers a search service of IU pages.
    • The pages for this course include a search form:
       
     <form method='get' action='http://search2.iu.edu/search'>
    <input type='text' name='q' size='32' maxlength='255' value=''/>
    <input type='submit' name='btnG' value='Search A346'/>
    <input type='hidden' name='site' value='ius'/>
    <input type='hidden' name='client' value='indiana'/>
    <input type='hidden' name='proxystylesheet' value='indiana'/>
    <input type='hidden' name='output' value='xml_no_dtd'/>
    <input type='hidden' name='as_sitesearch' value='http://homepages.ius.edu/rwisman/A346'>
    </form>
    • The form appears as:
    • The search is restricted to directory http://homepages.ius.edu/rwisman/A346
       
  • FreeFind is a free search service.
    • The search engine server is their's.
    • It supports categories, useful for partitioning search on a site.
    • Its free for some banner ads or pay for without banner ads.
    • Provides weekly search reports.
    • Try searching on the course site.  

      powered by FreeFind

  • Hosting your own search engine requires some effort on your part.
  • You'll need the search engine software, one in the public domain for Unix systems is htDig. 
  • There are some public domain software but generally limited to certain operating systems and functions.
  • Can re-spider you site whenever you want.
  • Have access to search results. Useful to read failed queries to fine-tune pages to be found.
  • Can enforce fine-grained control of the indexing and search by creating synonym (for common misspellings), keyword, and noise word lists.

 

Exercise

Search your Web site.

  • Create a page in FrontPage for searching your Web site.
  • Go to FreeFind to create an free account.
  • Read the email with your password and instructions to have your site spidered.
  • Create a link to the the search engine from one of your pages.
  • Test that you can find important words in pages on your site.