Popular Posts

June 29, 2024

What is robots txt with examples SEO

 

robots.txt is a text file webmasters create to instruct web robots (typically search engine robots) how to crawl pages on their website. The robots.txt file is part of the Robots Exclusion Standard, which specifies how to inform participating crawlers about the access permissions for certain parts of a website.

Structure and Syntax

The robots.txt file resides at the root of a website (e.g., https://www.example.com/robots.txt) and follows a specific syntax:

  1. User-agent: Specifies the robot or group of robots to which the rules apply. For example:

    • User-agent: * applies rules to all robots.
    • User-agent: Googlebot applies rules specifically to Google's crawler.
  2. Disallow: Specifies the URLs that are not to be crawled. For example:

    • Disallow: /private/ disallows crawling of all URLs under the /private/ directory.
    • Disallow: /cgi-bin/ disallows crawling of all URLs in the /cgi-bin/ directory.
  3. Allow: Optionally, specifies exceptions to the disallow rule for a specific user-agent. For example:

    • Allow: /public/page.html allows crawling of a specific page even if it's in a disallowed directory.
  4. Crawl-delay: Specifies the delay (in seconds) that robots should wait between requests to the site. For example:

    • Crawl-delay: 10 suggests a 10-second delay between successive requests.
  5. Sitemap: Specifies the location of the XML Sitemap(s) for the site. For example:

    • Sitemap: https://www.example.com/sitemap.xml informs robots of the location of the XML Sitemap file.

Example robots.txt File

Here's an example of how a robots.txt file might look for a fictional website:

User-agent: *

Disallow: /private/

Disallow: /cgi-bin/

Allow: /public/page.html

Crawl-delay: 10


User-agent: Googlebot

Disallow: /admin/

Allow: /public/page.html

User-agent: *
Disallow: /private/
Disallow: /cgi-bin/
Allow: /public/page.html
Crawl-delay: 10

User-agent: Googlebot
Disallow: /admin/
Allow: /public/page.html

Explanation

  • **User-agent: ***: Applies rules to all robots (* is a wildcard).

  • Disallow: /private/: Prevents all robots from crawling URLs under the /private/ directory.

  • Disallow: /cgi-bin/: Prevents all robots from crawling URLs under the /cgi-bin/ directory.

  • Allow: /public/page.html: Allows all robots to crawl the specific page /public/page.html, even though /public/ is otherwise disallowed.

  • Crawl-delay: 10: Suggests a 10-second delay between requests to the site for all robots.

  • User-agent: Googlebot: Applies rules specifically to Google's crawler.

  • Disallow: /admin/: Prevents Googlebot from crawling URLs under the /admin/ directory.

  • Allow: /public/page.html: Allows Googlebot to crawl /public/page.html, overriding the general Disallow rule for /public/.

Usage and Considerations

  • Location: Place the robots.txt file at the root of your website (e.g., https://www.example.com/robots.txt).
  • Syntax: Follow the exact syntax rules to ensure robots interpret your directives correctly.
  • Testing: Use tools like Google Search Console to test your robots.txt file to ensure it's correctly configured.
  • Sitemap: Include a Sitemap directive to help search engines discover your XML Sitemap(s).

Robots.txt is an essential tool for managing how search engines and other bots interact with your website, ensuring efficient crawling and indexing while protecting sensitive content.


No comments:
Write comments