Posted on March 29th, 2011
If you are building a public facing web site there is a good chance that you will have a "dev" and a "production" deployment setup so your client can view and approve work at the "dev" level before it is released to the "production" level. You may need your "dev" site to be publicly accessible so the client can show it off to people in their circle without requiring you to set up allowed access (not ideal, but very "real world"). This setup doesn't really pose a problem for our content since it will be the same at both locations...but it does pose a problem for search engine optimization. We will need to make sure that search engines don't discover and index (with their crawler bots) both site locations since they will contain duplicate content, as well as the fact that the "dev" location is not meant for external consumption. If we use Google as an example, we will need to have a different robots.txt
file for each target deployment. But how can we accomplish that within a MVC3 web project?
Lets identify our goal:
Allow both "dev" and "production" to support the same route to our
robots.txt
file (http://ourdomain.com/robots.txt) but have the contents of the file change based on target deployment without having to do any extra steps when deploying.
We will accomplish this by dynamically rendering out the robots.txt
file instead of including that file in our project. The content of this file will be different based on our target deployment. We can leverage config transforms to handle a flag that identifies which content to use.
Lets start by creating an appSetting
in our web.config
files. If you create a new MVC3 web project the web.config
file already has 2 config transform files created, Web.Debug.config
and Web.Release.config
based on your default Visual Studio 2010 build configurations.
If we open up the Web.config
file we can see the appSettings
node looks like so:
We will add a new setting called RobotsTextAccess
and will use a value of either "private" or "public" to denote what type of content to use in our robots.txt
file. Lets set this to "private" at this level.
Since we set the default value here at the Web.config
file, our transforms will automatically use this unless we override them at the target build config level. With the value set to "private", we could technically leave our Web.Debug.config
alone as it will use that same value (and we want the debug target to be private). However, it is a good idea to add the transform to both targets in case we want to change it at the top level in the future.
We need to add an
node (inside of the
node) to our Web.Debug.config
and Web.Release.config
files as well as a child
node to do the transform.
The Web.Debug.config
code will look like:
The Web.Release.config
will look like:
With our settings flag in place we can start crafting the code to render the robots.txt
content. We need to create a controller called CrawlerController
for handling the logic. We will include an action method called RobotsTextFile
that will render our robots.txt
file. This method will check our web.config
app setting flag and for the time being, will write the robots.txt
content directly to the Response
stream.
UPDATE
I changed the allowed logic to correctly use an empty Disallow statement as pointed out by rtpHarry in the comments below
using System.Configuration;
using System.Web.Mvc;
namespace Website.Controllers
{
public class CrawlerController : Controller
{
public ActionResult RobotsTextFile()
{
string content;
if(ConfigurationManager.AppSettings["RobotTextAccess"].ToLower() == "private")
{
content = "User-agent: *\r\nDisallow: /";
}
else
{
content = "User-agent: *\r\nDisallow:";
}
Response.ContentType = "text/plain";
Response.Write(content);
return null;
}
}
}
Our logic starts with creating a local variable for storing our file content string and is followed by a check of the application setting flag to see if we want to render the "private" content, and if not then we will assume we want to render the "public" content. We then set the content of the file in a single line. After our flag check we can write the content directly to our Response
stream and be done.
Next we need to add a route to our route table to support navigation to the http://ourdomain.com/robots.txt destination. In our Global.asax
file we need to map a new route in the RegisterRoutes
method:
public static void RegisterRoutes(RouteCollection routes)
{
routes.IgnoreRoute("{resource}.axd/{*pathInfo}");
routes.MapRoute(
"Robots.txt",
"Robots.txt",
new { controller = "Crawler", action = "RobotsTextFile" }
);
routes.MapRoute(
"Default", // Route name
"{controller}/{action}/{id}", // URL with parameters
new { controller = "Home", action = "Index", id = UrlParameter.Optional } // Parameter defaults
);
}
With the route in place we can now F5 our application and navigate directly to our robots.txt route and see our results:
Lets see if our "public" flag is working as well. Open up the Web.config
file and change the flag value to "public", F5 again and navigate to the robots.txt
file again:
Disco! We are in business now. But wait, our project build configuration is currently set to "Debug". How come it didn't use the Web.Debug.config
setting? The web config transforms are used upon deployment or packaging of a project. When you run the site from the Visual Studio debugger it does not apply any transforms and uses the top level Web.config
file. This is a prime example of why we want to include the transform setting in both of our targets. It allows us to play with our settings in our local development environment and not have to remember to set them back before we deploy.
That's it. Done. We have established a way to handle target deployment specific crawler access file content. From here we could refactor the way we populate the Response
content. Maybe we want to create 2 project text files for our content to make it easier to edit and visualize (instead of doing it in code). If so, we could update our controller action logic to open the correct file based on the flag and then stream the contents directly to the Response
stream. Most likely we will want to add other crawler file support (for oh, I don't know, maybe Bing!). We could add another controller action for that search engine and add another route to our route table for their desired file name. I leave it up to you to take it from here. Code on!
For a detailed description of config transforms you can check out the MSDN article at http://msdn.microsoft.com/en-us/library/dd465326.aspx. If anyone know of some good links to posts on config transforms in general please feel free to add them to the comments.
Discussion
andrexx
Great article! For fix IE9 navigation right click on toolbar and check 'Show tabs on separate row' :)
Justin Schwartzenberger
peter
You can just drag on the right side of the navigationbar to enlarge it....
Rashmi
Quite helpful. Would be of great help in my uat and dev servers.
rtpHarry
Hey,
Just wanted to point out that the robots standard doesn't have an "allow" in the spec.
If I recall correctly Google does understand the allow keyword but the correct way to do it is to have an empty Disallow: statement.
You can read this at the site http://www.robotstxt.org/robotstxt.html just scroll down to the heading "To allow all robots complete access"
Ramandeep Singh
Hi There,
This is a nice implementation.
But I have a query regarding the same as I have tried it but not yet implemented on the production environment. Because when I am hitting the actual robots.txt file and using Firebug when looking at the Content-Type in the response headers it shows Content-Type = text/plain.
But when I am showing the content dynamically through the example you provided the response header shows Content-Type = text/plain; charset=utf-8 .
My query is will crawler be able to crawl through it if the Content-Type = text/plain; charset=utf-8 as crawlers are rigid in this context?
Thanks, Ramandeep Singh
Justin Schwartzenberger
Justin Schwartzenberger
There is a pretty good article about the charset requirements for search engine bots here:
http://hakre.wordpress.com/2010/07/20/encoding-of-the-robots-txt-file/
Hope that helps!
Petr L
Great idea, great blog!
Web Server India
I really like the tricks which you have recommended. It's really great to control dynamically to robots.txt