Indexing Database Content with dotLucene
5/4/2005 2:02:34 PM
In this article we will discuss how you can use dotLucene to index content located within a database. dotLucene is a open source full-text catalog solution developed in C# which is a direct line by line port of the popular Lucene engine based on Java.
In this article we will discuss how you can use
dotLucene to index content located
within a database. dotLucene is a open
source full-text catalog solution developed in C# which is a direct line by line
port of the popular Lucene engine based on Java.
In this particular example, we will discuss how the Weblog indexer was created
for BlingBlangBlog.com and
WWWCoder.com that goes out and spiders Weblog content, places it into a
warehouse, and then code is written to index the content via
dotLucene for later searching.
In order to index the Weblog
content, a Windows application was created to spider the RSS content and place
it into a database table. The only application that actually makes any requests
to this database table is the spider and indexer applications. Our Web
application will connect to the dotLucene
catalog directly for searching.
We're not going to cover the
actual spider in this article, basically the spider will make requests to the
RSS feeds located in the database, query each item, check the database to see if
the items are in there, if not then it will insert the individual item in the
database table. Once we have our table populated, we'll then create a query that
returns a recordset and then populates our index.
Indexing Content
The first method we'll cover
is the AddDocumentsToIndex method. This method can be instantiated from a button
click. The method will make a request to our database to obtain the records and
their field values that we want to place in our index.
Public
Sub AddDocumentsToIndex()
Dim CatalogDir As String = "c:\pathtoindex\"
Dim DSNASPSearch As
DataSet
Dim SQLQ As
String = "SELECT * FROM MyTable" 'could be a
join, sproc, whatever.
Dim sqlConnection As
SqlConnection = New SqlConnection(DB_Connection)
sqlConnection.Open()
Dim SqlCommand As
SqlCommand = New SqlCommand(SQLQ, sqlConnection)
Dim dr As
SqlDataReader = SqlCommand.ExecuteReader()
'the CateLogDir is a
string value of the physical directory
Dim Indexer As
New Indexer(CatelogDir)
While dr.Read
Indexer.AddDocument(dr("pubDate"), _
dr("Description"), _
dr("URL"),
_
dr("Title"))
End While
dr.Close()
dr =
Nothing
sqlConnection.Close()
sqlConnection =
Nothing
End Sub
We first set a value for the physical directory of the
index location. Again for demonstration we hard coded the values here, you could
set up some configuration file for obtaining the values. You can see in the code
above that we're making a call to the Indexer.AddDocument method, this method is
contained in the Indexer class which focuses on interaction with the
dotLucene API.
You can see in the indexer class, it is pretty thin, you
basically call methods of dotLucene to
create an instance of the catalog, and then populate a document with the values
from your record. Another item of note is the method ReturnSortDate, this
formats the date in the database to 20050501 so we can sort by date within our
search.
Imports
Lucene.Net.Documents
Imports
Lucene.Net.Index
Imports
Lucene.Net.Analysis.Standard
Imports
System.Text.RegularExpressions
Imports
System
Public
Class Indexer
Private
writer As IndexWriter
Public Sub
New(ByVal
Directory As String)
InitializeIndex(Directory)
End Sub
Public Sub
InitializeIndex(ByVal Directory
As String)
writer
= New IndexWriter(Directory,
New StandardAnalyzer,
True)
writer.SetUseCompoundFile(True)
End Sub
Public Sub
AddDocument(ByVal DateAdded
As String, _
ByVal Description As
String, ByVal
URL As String, _
ByVal Title As String)
Try
Dim doc As
New Document
doc.Add(Field.Keyword("date", DateAdded))
doc.Add(Field.Text("description", Description))
doc.Add(Field.Text("url", URL))
doc.Add(Field.Text("title", Title))
doc.Add(Field.Keyword("sortdate",
ReturnSortDate(DateAdded)))
writer.AddDocument(doc)
Catch ex As
Exception
'failed to add doc
End Try
End Sub
Private Function
ReturnSortDate(ByVal DateAdded
As Date)
As Integer
'Returns 20050511 for sorting on date.
Dim thisMonth As
String = DateAdded.Month.ToString
'add leading zero
If thisMonth.Length = 1
Then thisMonth = "0" & thisMonth
Dim thisYear As
String = DateAdded.Year.ToString
Dim thisDay As
String = DateAdded.Day.ToString
If thisDay.Length = 1
Then thisDay = "0" & thisDay
Return thisYear & thisMonth & thisDay
End Function
Private Function
Close()
writer.Optimize()
writer.Close()
End Function
End
Class
Searching Content
That's it, we have a searchable index ready for our search
code in a Web form. dotLucene comes with
search examples in the source distribution, so here we'll cover some basic
searching of this index. The sample code includes paging routines, and binding
of a DataTable to a Web form. In the next block of code we'll do a search on the
catalog and return a DataTable with the results. The Search method will accept
the search query passed from a textbox.
Imports
Lucene.Net.Documents
Imports
Lucene.Net.Index
Imports
Lucene.Net.Analysis.Standard
Imports
System.Text.RegularExpressions
Imports
System
Public
Class Searcher
Dim
searcher As IndexSearcher
Sub New(ByVal
Directory As String)
searcher = New IndexSearcher(Directory)
End Sub
Public
Function Search(ByVal
Query As String,
_
Optional ByVal
SortBy As String
= "date") As DataTable
‘we’ll
create our datatable structure to mirror the index.
Dim Results As
New DataTable
Results.Columns.Add("Title")
Results.Columns.Add("Description")
Results.Columns.Add("URL")
Results.Columns.Add("Published")
Dim
MyQuery As Query = QueryParser.Parse(Query,
"description", _
New StandardAnalyzer)
Dim Sort As Sort
= New Sort(SortBy, True)
Dim Hits As Hits
= searcher.Search(MyQuery, Sort)
mTotalRecs = Hits.Length
Dim
iCount As Integer = 0
While
iCount < mTotalRecs
Dim doc As
Document = Hits.Doc(iCount)
Dim row As
DataRow = Results.NewRow
row("url") = doc.Get("url")
row("Title") = doc.Get("title")
row("Description") = doc.Get("description")
row("Published") = doc.Get("date")
Results.Rows.Add(row)
iCount = iCount + 1
Loop
searcher.Close()
Return Results
End
Function
End
Class
Now that you have a DataTable created with the results from
your dotLucene search, you can bind it
to a DataGrid for viewing on the Web or some other application. For example:
Dim
searcher As New
Searcher("c:\pathtoindex\")
Dim
ds As New
DataTable
Dim
myQuery As String
= myKeywords
ds =
searcher.Search(myQuery, sort)
Where you pass the keywords or query to the Search method
and the sort type. In our application the sort method parameter sent to
dotLucene would be "date" for the date
field we calculated earlier using the ReturnSortDate method used to format our
date to a sortable integer.
|