One of the major problems with spiders like Google, AltaVista and other major
indexes, is they have a problem spidering dynamically generated sites like our
site WWWCoder.com.
Spiders can run into problems with sites generated dynamically, this is even more so with a large number of querystrings being passed to the script. Refer to the Google FAQ for more information. One way around this issue is to
use URL rewriting. The concept is it can accept a URL like
http://www.somesite.com/parentid/259/site/1795/default.aspx that the
indexes like Google can work with easily and then transform it on the server end
to the appropriate query strings that our application requires, like:
http://www.somsite.com/Default.aspx?parentid=259&site=1795.
In this tutorial we'll discuss how you can write code to allow spiders to spider
your site more easily.
Creating a HTTPHandler
First thing we're going to do is create an HTTPHandler, the handler will
intercept the request from in order to parse out the directory paths and change
them into query string parameters that our script can deal with.
To define a handler first open the web.config file and add the following
entry:
::
::
<system.web>
<httpModules>
<add name="HTTPHandler" type="MyNamespace.MyHTTPHandler, MyNamespace" />
</httpModules>
Then create a new class in your web for the HTTPHandler, in this example it
will be called HTTPHandler.vb. In the following code we have a couple methods,
the event we're discussing is the onBeginRequest. This method is fired before
actually serving any HTML content to the client. In this method is where we'll
check for a directory that is being sent by the request and then convert it to a
query string that our application can work with.
Imports System
Imports System.Configuration
Imports System.Text
Imports System.Web
Namespace MyNamespace
Public Class HTTPHandler
Implements IHttpModule
Public Sub Init(ByVal app As HttpApplication) Implements IHttpModule.Init
AddHandler app.BeginRequest, AddressOf Me.OnBeginRequest
End Sub
Public Sub Dispose() Implements IHttpModule.Dispose
End Sub
Public Delegate Sub MyEventHandler(ByVal s As Object, ByVal e As EventArgs)
Public Event MyEvent As MyEventHandler
Public Sub OnBeginRequest(ByVal s As Object, ByVal e As EventArgs)
Dim objHttpApplication As HttpApplication = CType(s, HttpApplication)
'check for a director path being sent instead of the query string.
Dim sURL As String = objHttpApplication.Request.Url.ToString.ToLower
If InStr(sURL, "/parentid/") <> 0 Then
'now we'll replace the paths with the appropriate query strings.
Dim sNewURL As String = Replace(sURL, "/parentid/", "?parentid="
sNewURL = Replace(sNewURL, "/site/", "&site=")
'now we'll rewrite the directory path and send our script a query string
'that it can deal with.
objHttpApplication.Context.RewritePath("~/Default.aspx" & sNewURL)
End If
End Sub
End Class
End Namespace
Generating a Spider Page
By intercepting the request we can then handle a directory path that is
requested and send our script the appropriate query string.
Now that we can convert a directory path to a query string what now? We need
to write a script that will output our typical dynamic links, but instead of
outputting query strings in our links, we'll output directory paths. You may
want to create a spider page aimed at providing links to your articles so Google
can go to that page and then be able to get to the dynamic content within your
site.
Imports System
Imports System.Collections
Imports System.Data
Imports System.Data.SqlClient
Imports System.Data.SqlTypes
Imports System.Reflection
Imports System.Text
Imports System.Configuration
Namespace MyNamespace
Public MustInherit Class DirectoryCrawler
Inherits System.Web.UI.Page
Protected WithEvents lblLinks As System.Web.UI.WebControls.Label
Protected WithEvents pnlModuleContent As System.Web.UI.WebControls.Panel
Private Sub Page_Load(ByVal sender As System.Object, ByVal e As _
System.EventArgs) Handles MyBase.Load
Dim strContent As String
If Not (IsPostBack) Then
strContent = CreateHTML()
lblLinks.Text = strContent
End If
End Sub
'This function will call a stored procedure and generate our links
'using the directory listing format instead of the query string.
Public Function CreateHTML() As String
Try
Dim strHTMLContent As String
Dim myObject As New myObjectDB
'now call a stored procedure and return a data reader.
Dim dr As SqlDataReader = myObject.GetAllArticles()
If dr.Read Then
'build the directory path for later handling by our HTTPHandler.
strHTMLContent &= "<a href=" & IIf(Request.ApplicationPath = "/", "", _
Request.ApplicationPath) & "/parentid/" & dr("SiteCatID") & "/site/" & _
dr("SiteID") & "/" & "/default.aspx >" & dr("SiteName") & "</a>"
Do While dr.Read
strHTMLContent &= "<a href=" & IIf(Request.ApplicationPath = "/", _
"", Request.ApplicationPath) & "/parentid/" & dr("SiteCatID") & "/site/" & _
dr("SiteID") & "/" & "/default.aspx >" & dr("SiteName") & "</a>"
Loop
End If
Return strHTMLContent
Catch ex As Exception
Return ex.Message
End Try
End Function
End Class
End Namespace
Now in our aspx page, we'll have the following objects where the links will be presented.
<%@ Control Language="vb" AutoEventWireup="false" Codebehind="DirectoryCrawler.ascx.vb"
Inherits="MyNamespace.DirectoryCrawler" %>
<asp:Panel ID="pnlModuleContent" Runat="server">
<asp:Label id="lblLinks" Runat="server" EnableViewState="False" Enabled="True"
CssClass="Invisible" ></asp:Label>
</asp:Panel>
This page you can have on your site to display all the content and allow the
spiders to index your site and get your content out to the rest of the world.
Additional Tips
Just an additional tip on making sure engines can spider your site, avoid the
use of post backs as much as possible. Most spiders cannot do a post request to
your Web server. There are many applications that use postbacks to go to the
next step of their application.. Postbacks are great for being able to capture
events and run some method, but if you want a spider to be able to spider all of
your content in addition to the main display, then you are going to have to
handle request parameters rather than using post backs in your application.
Another benefit of writing your applications to use request parameters, this
will also reduce the size of your viewstate making your applications smaller in
size since every request becomes a new request to the server.
By: Patrick Santry, Microsoft MVP (ASP/ASP.NET), developer of this site, author of books on Web technologies, and member of the DotNetNuke core development team. If you're interested in the services provided by Patrick, visit his company Website at Santry.com.