A:There is the Nutch project, now part of the Apache Foundation that allows you to build on the basic search engine by customizing what gets indexed. It uses basic config files to tell the indexer to either index everything or only a select set of URLs.
http://lucene.apache.org/nutch/