Ways to snatch Defeat from the Jaws of Victory
You may have set up your repository and filled it with interesting papers, but it is still possible to screw things up technically so that search engines and harvesters cannot index your material. Here are some common gotchas:
- Require all visitors to have a username and password
- Harvesters and crawlers will be locked out, and a lot of end users
will give up and go away. It is reasonable to require a username and password
for depositing items, but not for just searching and reading.
- Harvesters and crawlers will be locked out, and a lot of end users
will give up and go away. It is reasonable to require a username and password
for depositing items, but not for just searching and reading.
- Do not have a 'Browse' interface with hyperlinks between pages
- Search engine crawlers will never index past your first page. Button-style
controls cannot normally be followed.
- Search engine crawlers will never index past your first page. Button-style
controls cannot normally be followed.
- Set a 'robots.txt' file and/or use 'robots' meta tags in HTML headers that prevent search engine
crawling
- Google, Yahoo!, etc., may find your pages, but if you tell them not
to index them or to follow the links, they won't.
- Google, Yahoo!, etc., may find your pages, but if you tell them not
to index them or to follow the links, they won't.
- Restrict access to embargoed and/or other (selected) full texts
- Search engines and harvesters may index the metadata pages, but not
the full texts of the relevant items.
- Search engines and harvesters may index the metadata pages, but not
the full texts of the relevant items.
- Accept poor quality or restrictive PDF files
- Some PDF-making software packages (usually free, cheap, or esoteric) generate
poor quality PDF files that sometimes cannot be read properly by harvesting and indexing
programs. However, you can still cause problems even with high-end software if you use
it to restict the functionality of the PDF file - e.g. preventing copy-and-paste.
It may not be possible to index such files.
- Some PDF-making software packages (usually free, cheap, or esoteric) generate
poor quality PDF files that sometimes cannot be read properly by harvesting and indexing
programs. However, you can still cause problems even with high-end software if you use
it to restict the functionality of the PDF file - e.g. preventing copy-and-paste.
It may not be possible to index such files.
- Hide your OAI Base URL
- Have awkward URLs
- Many harvesters and firewalls will spit out or block:
- Numeric URLs - e.g. http://130.226.203.32/
- URLs that use 'https:' instead of 'http:'
- URLs that include unusual port numbers e.g. :47231
Stick to 'http:' and alphabetical URLs. It should be possible to avoid using port numbers in URLs.
- Many harvesters and firewalls will spit out or block:
If you know of any other ways in which things may go awry, please contact us and we will consider adding them to the list.
To help identify problem pages on your site and verify the crawling process, you may like sign up to Google's Webmaster Tools.