Santry Technology Solutions, Content Management, DotNetNuke, SharePoint Consulting
Register | Login
Wednesday, January 07, 2009

Sections
  
About Us
  
Partners
Downloads
  
 WWWCoder.com Resource Directory

Making Your App Spider Friendly
11/15/2003 8:56:57 PM

In this article we will cover using HTTPHandlers to intercept a request and change a directory path to query string parameters. Then we'll create a spider friendly link page so indexes can spider dynamically generated content pages.

One of the major problems with spiders like Google, AltaVista and other major indexes, is they have a problem spidering dynamically generated sites like our site WWWCoder.com.

Spiders can run into problems with sites generated dynamically, this is even more so with a large number of querystrings being passed to the script. Refer to the Google FAQ for more information. One way around this issue is to use URL rewriting. The concept is it can accept a URL like http://www.somesite.com/parentid/259/site/1795/default.aspx that the indexes like Google can work with easily and then transform it on the server end to the appropriate query strings that our application requires, like: http://www.somsite.com/Default.aspx?parentid=259&site=1795. In this tutorial we'll discuss how you can write code to allow spiders to spider your site more easily.

Creating a HTTPHandler

First thing we're going to do is create an HTTPHandler, the handler will intercept the request from in order to parse out the directory paths and change them into query string parameters that our script can deal with.

To define a handler first open the web.config file and add the following entry:

::
::
<system.web>
<httpModules>
<add name="HTTPHandler" type="MyNamespace.MyHTTPHandler, MyNamespace" />
</httpModules>

Then create a new class in your web for the HTTPHandler, in this example it will be called HTTPHandler.vb. In the following code we have a couple methods, the event we're discussing is the onBeginRequest. This method is fired before actually serving any HTML content to the client. In this method is where we'll check for a directory that is being sent by the request and then convert it to a query string that our application can work with.

Imports System
Imports System.Configuration
Imports System.Text
Imports System.Web
Namespace MyNamespace
  Public Class HTTPHandler
    Implements IHttpModule
    Public Sub Init(ByVal app As HttpApplication) Implements IHttpModule.Init
      AddHandler app.BeginRequest, AddressOf Me.OnBeginRequest
    End Sub
    Public Sub Dispose() Implements IHttpModule.Dispose
    End Sub
    Public Delegate Sub MyEventHandler(ByVal s As Object, ByVal e As EventArgs)
    Public Event MyEvent As MyEventHandler
    Public Sub OnBeginRequest(ByVal s As Object, ByVal e As EventArgs)
      Dim objHttpApplication As HttpApplication = CType(s, HttpApplication)
      'check for a director path being sent instead of the query string.
      Dim sURL As String = objHttpApplication.Request.Url.ToString.ToLower
      If InStr(sURL, "/parentid/") <> 0 Then
        'now we'll replace the paths with the appropriate query strings.
        Dim sNewURL As String = Replace(sURL, "/parentid/", "?parentid="   
        sNewURL = Replace(sNewURL, "/site/", "&site=")     
        'now we'll rewrite the directory path and send our script a query string
        'that it can deal with.
        objHttpApplication.Context.RewritePath("~/Default.aspx" & sNewURL)                
      End If
    End Sub        
  End Class
End Namespace

Generating a Spider Page

By intercepting the request we can then handle a directory path that is requested and send our script the appropriate query string.

Now that we can convert a directory path to a query string what now? We need to write a script that will output our typical dynamic links, but instead of outputting query strings in our links, we'll output directory paths. You may want to create a spider page aimed at providing links to your articles so Google can go to that page and then be able to get to the dynamic content within your site.

Imports System
Imports System.Collections
Imports System.Data
Imports System.Data.SqlClient
Imports System.Data.SqlTypes
Imports System.Reflection
Imports System.Text
Imports System.Configuration
Namespace MyNamespace
  Public MustInherit Class DirectoryCrawler
    Inherits System.Web.UI.Page
    Protected WithEvents lblLinks As System.Web.UI.WebControls.Label
    Protected WithEvents pnlModuleContent As System.Web.UI.WebControls.Panel
        
    Private Sub Page_Load(ByVal sender As System.Object, ByVal e As _
        System.EventArgs) Handles MyBase.Load
      Dim strContent As String
      If Not (IsPostBack) Then
         strContent = CreateHTML()
         lblLinks.Text = strContent
      End If
    End Sub    
    
    'This function will call a stored procedure and generate our links
    'using the directory listing format instead of the query string.
    Public Function CreateHTML() As String
      Try
        Dim strHTMLContent As String
        Dim myObject As New myObjectDB
        'now call a stored procedure and return a data reader.
        Dim dr As SqlDataReader = myObject.GetAllArticles()
        If dr.Read Then
          'build the directory path for later handling by our HTTPHandler.
          strHTMLContent &= "<a href=" & IIf(Request.ApplicationPath = "/", "", _
               Request.ApplicationPath) & "/parentid/" & dr("SiteCatID") & "/site/" & _
               dr("SiteID") & "/" & "/default.aspx >" & dr("SiteName") & "</a>"
            Do While dr.Read
              strHTMLContent &= "<a href=" & IIf(Request.ApplicationPath = "/", _
                 "", Request.ApplicationPath) & "/parentid/" & dr("SiteCatID") & "/site/" & _
                 dr("SiteID") & "/" & "/default.aspx >" & dr("SiteName") & "</a>"
            Loop
        End If
        Return strHTMLContent
          Catch ex As Exception
            Return ex.Message
        End Try
      End Function
    End Class
End Namespace

Now in our aspx page, we'll have the following objects where the links will be presented.

<%@ Control Language="vb" AutoEventWireup="false" Codebehind="DirectoryCrawler.ascx.vb" 
Inherits="MyNamespace.DirectoryCrawler" %>
<asp:Panel ID="pnlModuleContent" Runat="server">
<asp:Label id="lblLinks" Runat="server" EnableViewState="False" Enabled="True" 
CssClass="Invisible" ></asp:Label>
</asp:Panel>

This page you can have on your site to display all the content and allow the spiders to index your site and get your content out to the rest of the world.

Additional Tips

Just an additional tip on making sure engines can spider your site, avoid the use of post backs as much as possible. Most spiders cannot do a post request to your Web server. There are many applications that use postbacks to go to the next step of their application.. Postbacks are great for being able to capture events and run some method, but if you want a spider to be able to spider all of your content in addition to the main display, then you are going to have to handle request parameters rather than using post backs in your application. Another benefit of writing your applications to use request parameters, this will also reduce the size of your viewstate making your applications smaller in size since every request becomes a new request to the server.

By: Patrick Santry, Microsoft MVP (ASP/ASP.NET), developer of this site, author of books on Web technologies, and member of the DotNetNuke core development team. If you're interested in the services provided by Patrick, visit his company Website at Santry.com.


Page Options:
format for printing  Format for Printer
email article  Email Page
add to your favorites   Add to Favorites
How would you rate the quality of this content?
Poor - - Excellent
Comments?
Overall Rating:
Comments Left:
Left on 11/11/2007 7:52:46 PM by Anonymous
Comments: Hello! Good Site! Thanks you! rtgqgcflcwro
Left on 5/28/2005 7:33:52 AM by Anonymous
Comments: this is what I needed.
though I do want to do everything in C#.
so I translated this in C# and I am leaving my HttpModule code here, hoping it may be useful.

using System;
using System.Text.RegularExpressions;
using System.Web;

public class ProductDetailsHandler : System.Web.IHttpModule
{
public void Init(HttpApplication application)
{
application.BeginRequest += new EventHandler(application_BeginRequest);
}

public void Dispose() {}

private void application_BeginRequest(object sender, EventArgs e)
{
HttpApplication application = (HttpApplication) sender;

//check for a director path being sent instead of the query string.
string url = application.Request.Url.ToString();

if ((url.ToLower().IndexOf("productdetails/") != -1) && (url.ToLower().IndexOf("addtocart") == -1))
{
Regex r = new Regex(@"ProductDetails/(?\w{1,16}).aspx$", RegexOptions.IgnoreCase);
Match m = r.Match(url);

if (m.Success)
{
string productId = m.Groups["ProductId"].Value;
application.Context.RewritePath("/ProductDetails.aspx?ProductId="+productId);
}
else
{
application.Context.RewritePath("/ProductDetails.aspx");
}
}
}
}

Left on 2/15/2005 6:24:09 AM by Anonymous
Comments: Howdy, I liked your article so much I've created a comment and link to it on my weblog.
Nice work!
My comment about this article
No ratings available.
Left on 4/5/2004 9:38:36 PM by Anonymous
Comments: Comments from the following blog: Patrick Santry's Blog, located at: http://blogs.wwwcoder.com/psantry/archive/2004/04/05/357.aspx
No ratings available.
Left on 12/22/2003 10:23:03 AM by Anonymous
Comments: Thanks to the Author for finding time to share his tips with the web community (that does not completely realise the significance of it)... For those thinking that this approach is only suitable for POST-less webforms: your're wrong :) You may find somewhere on the web a simple technique for overriding the "Render" procedure in order to replace the form's "action" attribute to whatever needed... see how these techniques are being well implemented on some of my sites (agriterra.com, orthodoxeurope.org, voyageservice.net)
Left on 12/1/2003 9:51:55 AM by Anonymous
Comments: For the comment on whether or not the author did any research, the question should be have you read the Google FAQ, no one has said that Google cannot spider pages generated dynamically, however Google does refrain from indexing dynamic pages. So by using the above method you make the spider "think" that your pages are not being generated dynamically by creating a unique directory structure for each request instead of using a query string. Refer to the following FAQ quote from Google: "We are able to index dynamically generated pages. However, because our web crawler can easily overwhelm and crash sites serving dynamic content, we limit the amount of dynamic pages we index. "
No ratings available.
Left on 12/1/2003 9:37:18 AM by Anonymous
Comments: My comment from before must have got erased because they didn't like it because it countered their whole article, but if you look at the following link (http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=x+site%3Arainbowportal.net) you will see that all are of the same page but with different query strings.  Same is true of Altavista.  I am just wondering if the author did any research at all?
Left on 11/29/2003 11:00:40 AM by Anonymous
Comments: Although this code can run on complex postback pages after some modifications, I still prefer 404 method. I have used this at several places and works without any problems at all
Left on 11/27/2003 10:12:55 AM by Anonymous
Comments: For the poster below, Did you even bother to read the Google FAQ referenced in the article?
No ratings available.
Left on 11/27/2003 10:03:48 AM by Anonymous
Comments: You can change the 404 page to be an ASPX page without affecting whether things like images get processed using the ASP.NET processor by changing the 404 handler in IIS. You then make your website links to folders which do not really exist (eg /articles/1282/) and as there is no default.aspx in this folder, the specified 404 handler aspx page will be involked in IIS. On this page, you can then change the response code to 200 OK, and Server.Transfer to the correct page.  I have implemented this on many of my websites and it works perfectly. The people below have obviously not even tried my method or do not understand my description of it. Unlike the code above, normal requests to aspx pages which exist are not affected and do not have to have an HTTP Handler process every single request as shown above.

This method is much simpler than the above method, but may not necessarily be any better.

Don't forget that contrary to what is implied in this article Google CAN crawl dynamic aspx pages, so all you need to do is ensure you link to them - you do not need to bother with URL rewriting to get your site indexed by Google.
nick(at)x-rm.com
Left on 11/27/2003 7:07:19 AM by Anonymous
Comments: It is important to note, especially for novices who are trying to use url rewriting, that the original url must include a reference to a .aspx page. If the url doesn't contain a .aspx page the web server doesn't know you are wanting to use the ASP.NET processor and so returns a 404 error (if the directory doesn't exist, which it probably won't - that's the whole point of url rewriting). This can be solved by reconfiguring the web server, so that every single request passes through the ASP.NET processor, however this would mean that images, javascript, css and other files that can be downloaded from your site would go through the processor (which is something I would not recommend).
Left on 11/26/2003 4:45:04 PM by Anonymous
Comments: I think the most important part of this article is the Postback problem. Postbacks will actually ruin this code. When a Postback occurs the Form url gets rewriiten with the wrong "spider fiendly" url. This code will only work on Postbackless sights.
Left on 11/26/2003 5:43:03 AM by Anonymous
Comments: Stringbuilder is for concatinating large strings, in the case of this where the string is very small the code is entirely acceptable. And the 404 handler would never work in this example, the problem here isn't that a spider is not finding the document and the server is returning a 404, the problem here is the spider will not index the document due to it being dynamic.
No ratings available.
Left on 11/26/2003 5:02:51 AM by Anonymous
Comments: The basic idea of the article is sound, b this is a very complicated and inefficient way of solving a very simple problem. It is much easier to do the same thing by implementing a custom 404 handler for your virtual subfolder of articles which can then change the response code to "200 OK" and Server.Transfer to the correct document. This can be done with about 5 lines of code and doesn't affect normal page requests.

Additionally, you should NEVER build large strings as shown above, you should use the StringBuilder class which is much much faster for building large strings.
Left on 11/26/2003 3:51:27 AM by Anonymous
Comments: Exactly what I was looking for.
Thx !!
  

 Latest Articles
  

 Latest News
  

 

Spotlight
Syndication

 


 


Digg This
 


DotNetNuke Platinum Benefactor

  
 

 Terms Of Use | Privacy Statement
 Copyright 2008 - Santry Technology Solutions, Box 172, Girard, PA 16417, (814) 774-0970