Introduction: This article focuses on Internet website building technology – what is the Robots.txt file, and what is its use for SEO? It provides tutorials on making websites, operating websites, SEO, and website promotion to help small and medium-sized web admins proliferate.
Robots.txt is a text file that gives instructions to search engine robots for SEO optimization.
If used correctly, it can ensure that search engine robots (also called crawlers or spiders) correctly crawl and index your website pages.
If used incorrectly, it may hurt SEO rankings and website traffic. So, how do I set up the robots.txt file correctly? Yideng will share some experience today, mainly including the following aspects.
Table of Contents
- 1. What is robots.txt?
- 2. What is the use of robots.txt for SEO?
- 3. How do you write robots.txt to meet SEO optimization?
- 4. Robots.txt Robots.txt Quiz or MCQ Questions Answers
- Summary
1. What is robots.txt?
Robots.txt is a plain text file in the website’s root directory. You need to add it yourself, as shown in the figure below.
If your website’s domain name is www.abc.com, the viewing address of robots.txt is www.abc.com/robots.txt.
The robots.txt file contains search engine robot instructions.
When a search engine robot visits your website, it will first check the contents of the robots.txt file, then crawl and index the website pages according to the instructions of robots.txt, and then include certain pages or exclude certain pages.
It should be noted that the robot.txt file is not a mandatory setting that must be done. As for whether to do it or not, why to do it, and its use, I will explain it in detail below.
2. What is the use of robots.txt for SEO?
Simply put, robots.txt has two functions: allowing and blocking search engine robots from crawling your website pages. If not, search engine robots will crawl the entire website, including all data content in the website root directory.
For the specific working principle, please refer to the description of elliance, as shown in the figure below.
In 1993, the Internet had just started, and very few websites could be discovered. Matthew Gray wrote a spider crawler program, World Wide Web Wanderer, to discover and collect new websites as website directories.
But the people who later engaged in crawling were not only collecting website directories, but also crawling and downloading a large amount of website data.
In July of the same year, the website data of Martijn Koster, the founder of Aliweb, was maliciously crawled, so he proposed the robots protocol.
The purpose is to tell spider crawlers which web pages can be crawled and which web pages cannot, especially those website data pages that do not want to be seen by people. After a series of discussions, robots.txt officially came onto the historical stage.
Rule | Example |
---|---|
Disallow entire website | User-agent: * Disallow: / |
Disallow specific directory or file | User-agent: * Disallow: /calendar/ Disallow: /junk/ |
Allow only a specific crawler | User-agent: Googlebot-news Allow: / |
Allow all crawlers except one | User-agent: Unnecessarybot Disallow: / User-agent: * Allow: / |
Disallow a specific page | User-agent: * Disallow: /private_file.html |
Disallow Google Images from accessing a specific image | User-agent: Googlebot-Image Disallow: /images/dogs.jpg |
Disallow Google Images from accessing all images | User-agent: Googlebot-Image Disallow: / |
Disallow specific file type (e.g. .gif) | User-agent: Googlebot Disallow: /*.gif$ |
Disallow entire website, but allow AdSense | User-agent: * Disallow: / User-agent: Mediapartners-Google Allow: / |
Block URLs ending with specific strings (e.g. .xls) | User-agent: Googlebot Disallow: /*.xls$ |
From the perspective of SEO, for newly launched websites, since there are fewer pages, it doesn’t matter whether robots.txt is used or not. However, as the number of pages increases, the SEO effect of robots.txt is reflected, mainly in the following aspects.
- Optimize crawling of search engine robots
- Prevent malicious crawling and optimize server resources
- Reduce duplicate content appearing in search results
- Hidden page links appear in search results
3. How do you write robots.txt to meet SEO optimization?
First, there is no default format for robots.txt files.
The writing method of robots.txt includes User-agent, Disallow, Allow and Crawl-delay.
- User-agent: Fill in the search engine you want to target, * represents all search engines
- Disallow: Fill in the website content and folders you want to prohibit crawling, / as the prefix
- Allow: Fill in the website content, folders and links you allow to crawl, / as the prefix
- Crawl-delay: Fill in a number at the end, which means crawling delay; small websites are not recommended
For example, if you want to prohibit Google robots from crawling the categories of your website, write it as follows:
User-agent: googlebot
Disallow: /category/
For example, if you want to prohibit all searches from causing crawling of the wp login links, write it as follows:
User-agent: *
Disallow: /wp-admin/
For example, if you only allow Google images to crawl your WordPress website images, the writing is as follows:
User-agent: Googlebot-Image
Allow: /wp-content/uploads/
For more specific writing methods, refer to the official Google robots.txt document, as shown below.
Rule | Example |
---|---|
Disallow entire website | User-agent: * Disallow: / |
Disallow specific directory or file | User-agent: * Disallow: /calendar/ Disallow: /junk/ |
Allow only a specific crawler | User-agent: Googlebot-news Allow: / |
Allow all crawlers except one | User-agent: Unnecessarybot Disallow: / User-agent: * Allow: / |
Disallow a specific page | User-agent: * Disallow: /private_file.html |
Disallow Google Images from accessing a specific image | User-agent: Googlebot-Image Disallow: /images/dogs.jpg |
Disallow Google Images from accessing all images | User-agent: Googlebot-Image Disallow: / |
Disallow specific file type (e.g. .gif) | User-agent: Googlebot Disallow: /*.gif$ |
Disallow entire website, but allow AdSense | User-agent: * Disallow: / User-agent: Mediapartners-Google Allow: / |
Block URLs ending with specific strings (e.g. .xls) | User-agent: Googlebot Disallow: /*.xls$ |
Although these writing instructions look complicated, as long as you use WordPress, it will become much simpler. After all, WP is Google’s son. As far as SEO is concerned, the best way to write robots.txt for WordPress websites is as follows, which requires text editing.
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://www.yourdomain.com/sitemap.xml
Or it can be like this.
User-agent: *
Allow: /
Sitemap: https://www.yourdomain.com/sitemap.xml
The difference is whether to prohibit crawling /wp-admin/.
Regarding /wp-admin/, WordPress added a new tag @header( ‘X-Robots-Tag: noindex’ ) in 2012, which has the same effect as using robots.txt to prohibit crawling /wp-admin/. If you are still worried, you can add it.
As for other website content and links you don’t want to be crawled by search engines, do it according to the needs of your website.
You can use robots.txt to prohibit crawling or Meta Robots to do Noindex. Meta Robots is for the links with the WordPress program, and robots.txt is for the website content pages that need to be hidden.
Robots.txt Quiz or MCQ Questions Answers
Summary
The next thing to do is to add the written robots.txt file to the WordPress website.
Based on my experience, the fewer instructions in robots.txt, the better. When I was still a novice, I read some articles by some great gods and banned many file directories and website content, especially /wp-includes/, which directly caused JS and CSS not to run normally.
Finally, it should be noted that the instructions in the robots.txt file are case-sensitive, so don’t make a mistake. After reading this, you should know what the Robots.txt file is used for and its impact on SEO.
After reading this article, let’s summarize the knowledge points involved in this article, including knowledge about Robots.txt, SEO optimization, Robots files, etc. These can help you better understand the topic of the article.
Statement: Some of the content on this website comes from the Internet. If the article infringes on your rights, please contact us soon, and we will verify and deal with it.