robots.txt is a text file webmasters create to instruct web robots (typically search engine robots) how to crawl pages on their website. The robots.txt file is part of the Robots Exclusion Standard, which specifies how to inform participating crawlers about the access permissions for certain parts of a website.
Structure and Syntax
The robots.txt file resides at the root of a website (e.g., https://www.example.com/robots.txt) and follows a specific syntax:
User-agent: Specifies the robot or group of robots to which the rules apply. For example:
User-agent: *
applies rules to all robots.User-agent: Googlebot
applies rules specifically to Google's crawler.
Disallow: Specifies the URLs that are not to be crawled. For example:
Disallow: /private/
disallows crawling of all URLs under the /private/ directory.Disallow: /cgi-bin/
disallows crawling of all URLs in the /cgi-bin/ directory.
Allow: Optionally, specifies exceptions to the disallow rule for a specific user-agent. For example:
Allow: /public/page.html
allows crawling of a specific page even if it's in a disallowed directory.
Crawl-delay: Specifies the delay (in seconds) that robots should wait between requests to the site. For example:
Crawl-delay: 10
suggests a 10-second delay between successive requests.
Sitemap: Specifies the location of the XML Sitemap(s) for the site. For example:
Sitemap: https://www.example.com/sitemap.xml
informs robots of the location of the XML Sitemap file.
Example robots.txt File
Here's an example of how a robots.txt file might look for a fictional website:
User-agent: *
Disallow: /private/
Disallow: /cgi-bin/
Allow: /public/page.html
Crawl-delay: 10
User-agent: Googlebot
Disallow: /admin/
Allow: /public/page.html
Explanation
**User-agent: ***: Applies rules to all robots (
*
is a wildcard).Disallow: /private/: Prevents all robots from crawling URLs under the /private/ directory.
Disallow: /cgi-bin/: Prevents all robots from crawling URLs under the /cgi-bin/ directory.
Allow: /public/page.html: Allows all robots to crawl the specific page
/public/page.html
, even though/public/
is otherwise disallowed.Crawl-delay: 10: Suggests a 10-second delay between requests to the site for all robots.
User-agent: Googlebot: Applies rules specifically to Google's crawler.
Disallow: /admin/: Prevents Googlebot from crawling URLs under the /admin/ directory.
Allow: /public/page.html: Allows Googlebot to crawl
/public/page.html
, overriding the generalDisallow
rule for/public/
.
Usage and Considerations
- Location: Place the robots.txt file at the root of your website (e.g., https://www.example.com/robots.txt).
- Syntax: Follow the exact syntax rules to ensure robots interpret your directives correctly.
- Testing: Use tools like Google Search Console to test your robots.txt file to ensure it's correctly configured.
- Sitemap: Include a
Sitemap
directive to help search engines discover your XML Sitemap(s).
Robots.txt is an essential tool for managing how search engines and other bots interact with your website, ensuring efficient crawling and indexing while protecting sensitive content.
No comments:
Write comments