Search Engine Crawler Access for Target Deployment Using Config T

Posted on March 29th, 2011

If you are building a public facing web site there is a good chance that you will have a "dev" and a "production" deployment setup so your client can view and approve work at the "dev" level before it is released to the "production" level. You may need your "dev" site to be publicly accessible so the client can show it off to people in their circle without requiring you to set up allowed access (not ideal, but very "real world"). This setup doesn't really pose a problem for our content since it will be the same at both locations...but it does pose a problem for search engine optimization. We will need to make sure that search engines don't discover and index (with their crawler bots) both site locations since they will contain duplicate content, as well as the fact that the "dev" location is not meant for external consumption. If we use Google as an example, we will need to have a different robots.txt file for each target deployment. But how can we accomplish that within a MVC3 web project?

Lets identify our goal:

Allow both "dev" and "production" to support the same route to our robots.txt file (http://ourdomain.com/robots.txt) but have the contents of the file change based on target deployment without having to do any extra steps when deploying.

We will accomplish this by dynamically rendering out the robots.txt file instead of including that file in our project. The content of this file will be different based on our target deployment. We can leverage config transforms to handle a flag that identifies which content to use.

Lets start by creating an appSetting in our web.config files. If you create a new MVC3 web project the web.config file already has 2 config transform files created, Web.Debug.config and Web.Release.config based on your default Visual Studio 2010 build configurations.

web config stock

If we open up the Web.config file we can see the appSettings node looks like so:

We will add a new setting called RobotsTextAccess and will use a value of either "private" or "public" to denote what type of content to use in our robots.txt file. Lets set this to "private" at this level.

Since we set the default value here at the Web.config file, our transforms will automatically use this unless we override them at the target build config level. With the value set to "private", we could technically leave our Web.Debug.config alone as it will use that same value (and we want the debug target to be private). However, it is a good idea to add the transform to both targets in case we want to change it at the top level in the future.

We need to add an node (inside of the node) to our Web.Debug.config and Web.Release.config files as well as a child node to do the transform.

The Web.Debug.config code will look like:

The Web.Release.config will look like:

With our settings flag in place we can start crafting the code to render the robots.txt content. We need to create a controller called CrawlerController for handling the logic. We will include an action method called RobotsTextFile that will render our robots.txt file. This method will check our web.config app setting flag and for the time being, will write the robots.txt content directly to the Response stream.

UPDATE
I changed the allowed logic to correctly use an empty Disallow statement as pointed out by rtpHarry in the comments below

using System.Configuration;
using System.Web.Mvc;

namespace Website.Controllers
{
    public class CrawlerController : Controller
    {
        public ActionResult RobotsTextFile()
        {
            string content;
            if(ConfigurationManager.AppSettings["RobotTextAccess"].ToLower() == "private")
            {
                content = "User-agent: *\r\nDisallow: /";
            }
            else
            {
                content = "User-agent: *\r\nDisallow:";
            }
            Response.ContentType = "text/plain";
            Response.Write(content);
            return null;
        }
    }
}

Our logic starts with creating a local variable for storing our file content string and is followed by a check of the application setting flag to see if we want to render the "private" content, and if not then we will assume we want to render the "public" content. We then set the content of the file in a single line. After our flag check we can write the content directly to our Response stream and be done.

Next we need to add a route to our route table to support navigation to the http://ourdomain.com/robots.txt destination. In our Global.asax file we need to map a new route in the RegisterRoutes method:

public static void RegisterRoutes(RouteCollection routes)
{
    routes.IgnoreRoute("{resource}.axd/{*pathInfo}");

    routes.MapRoute(
        "Robots.txt",
        "Robots.txt",
        new { controller = "Crawler", action = "RobotsTextFile" }
    );

    routes.MapRoute(
        "Default", // Route name
        "{controller}/{action}/{id}", // URL with parameters
        new { controller = "Home", action = "Index", id = UrlParameter.Optional } // Parameter defaults
    );

}

With the route in place we can now F5 our application and navigate directly to our robots.txt route and see our results:

Robots Debug

Lets see if our "public" flag is working as well. Open up the Web.config file and change the flag value to "public", F5 again and navigate to the robots.txt file again:

Robots Public

Disco! We are in business now. But wait, our project build configuration is currently set to "Debug". How come it didn't use the Web.Debug.config setting? The web config transforms are used upon deployment or packaging of a project. When you run the site from the Visual Studio debugger it does not apply any transforms and uses the top level Web.config file. This is a prime example of why we want to include the transform setting in both of our targets. It allows us to play with our settings in our local development environment and not have to remember to set them back before we deploy.

That's it. Done. We have established a way to handle target deployment specific crawler access file content. From here we could refactor the way we populate the Response content. Maybe we want to create 2 project text files for our content to make it easier to edit and visualize (instead of doing it in code). If so, we could update our controller action logic to open the correct file based on the flag and then stream the contents directly to the Response stream. Most likely we will want to add other crawler file support (for oh, I don't know, maybe Bing!). We could add another controller action for that search engine and add another route to our route table for their desired file name. I leave it up to you to take it from here. Code on!

For a detailed description of config transforms you can check out the MSDN article at http://msdn.microsoft.com/en-us/library/dd465326.aspx. If anyone know of some good links to posts on config transforms in general please feel free to add them to the comments.

Discussion

andrexx

29 Mar, 2011 05:38 PM

Great article! For fix IE9 navigation right click on toolbar and check 'Show tabs on separate row' :)

Justin Schwartzenberger

29 Mar, 2011 05:45 PM

andrexx
My day just got better as a result of your tip. Thanks!

peter

11 Apr, 2011 08:33 AM

You can just drag on the right side of the navigationbar to enlarge it....

Rashmi

11 Apr, 2011 10:39 AM

Quite helpful. Would be of great help in my uat and dev servers.

rtpHarry

11 Apr, 2011 11:29 AM

Hey,

Just wanted to point out that the robots standard doesn't have an "allow" in the spec.

If I recall correctly Google does understand the allow keyword but the correct way to do it is to have an empty Disallow: statement.

You can read this at the site http://www.robotstxt.org/robotstxt.html just scroll down to the heading "To allow all robots complete access"

Ramandeep Singh

11 Apr, 2011 02:16 PM

Hi There,

This is a nice implementation.

But I have a query regarding the same as I have tried it but not yet implemented on the production environment. Because when I am hitting the actual robots.txt file and using Firebug when looking at the Content-Type in the response headers it shows Content-Type = text/plain.

But when I am showing the content dynamically through the example you provided the response header shows Content-Type = text/plain; charset=utf-8 .

My query is will crawler be able to crawl through it if the Content-Type = text/plain; charset=utf-8 as crawlers are rigid in this context?

Thanks, Ramandeep Singh

Justin Schwartzenberger

11 Apr, 2011 02:50 PM

*rtpHarry *
Thanks for pointing that out! I updated the article and screenshot to be in line with the correct way to allow crawler access.

Justin Schwartzenberger

11 Apr, 2011 03:00 PM

Ramandeep Singh
You can change the charset in the code right below the Response.ContentType="text/plain" call by adding a call to set the Response.Charset as so:

using System.Configuration;
using System.Web.Mvc;

namespace Website.Controllers
{
    public class CrawlerController : Controller
    {
        public ActionResult RobotsTextFile()
        {
            string content;
            if(ConfigurationManager.AppSettings["RobotTextAccess"].ToLower() == "private")
            {
                content = "User-agent: *\r\nDisallow: /";
            }
            else
            {
                content = "User-agent: *\r\nDisallow:";
            }
            Response.ContentType = "text/plain";
            Response.Charset = "US-ASCII";
            Response.Write(content);
            return null;
        }
    }
}

There is a pretty good article about the charset requirements for search engine bots here:
http://hakre.wordpress.com/2010/07/20/encoding-of-the-robots-txt-file/

Hope that helps!

Petr L

11 Apr, 2011 11:12 PM

Great idea, great blog!

Web Server India

16 Apr, 2011 05:24 PM

I really like the tricks which you have recommended. It's really great to control dynamically to robots.txt

Discussion

Your Comments

Preview

Other Publications

More ASP.NET MVC Content

Presentations