facebook  linkedin  Twitter  skype  Rss googlePlus

Indexing Database Content with dotLucene

May 04 2005

In this article we will discuss how you can use dotLucene to index content located within a database. dotLucene is a open source full-text catalog solution developed in C# which is a direct line by line port of the popular Lucene engine based on Java.

In this particular example, we will discuss how the Weblog indexer was created for BlingBlangBlog.com and WWWCoder.com that goes out and spiders Weblog content, places it into a warehouse, and then code is written to index the content via dotLucene for later searching.

In order to index the Weblog content, a Windows application was created to spider the RSS content and place it into a database table. The only application that actually makes any requests to this database table is the spider and indexer applications. Our Web application will connect to the dotLucene catalog directly for searching.


We're not going to cover the actual spider in this article, basically the spider will make requests to the RSS feeds located in the database, query each item, check the database to see if the items are in there, if not then it will insert the individual item in the database table. Once we have our table populated, we'll then create a query that returns a recordset and then populates our index.


Indexing Content


The first method we'll cover is the AddDocumentsToIndex method. This method can be instantiated from a button click. The method will make a request to our database to obtain the records and their field values that we want to place in our index.

Public Sub AddDocumentsToIndex()
        Dim CatalogDir As String = "c:\pathtoindex\"
        Dim DSNASPSearch As DataSet
        Dim SQLQ As String = "SELECT * FROM MyTable" 'could be a join, sproc, whatever.
        Dim sqlConnection As SqlConnection = New SqlConnection(DB_Connection)
        Dim SqlCommand As SqlCommand = New SqlCommand(SQLQ, sqlConnection)
        Dim dr As SqlDataReader = SqlCommand.ExecuteReader()
        'the CateLogDir is a string value of the physical directory
        Dim Indexer As New Indexer(CatelogDir)
        While dr.Read
            Indexer.AddDocument(dr("pubDate"), _
                dr("Description"), _
                dr("URL"), _
        End While
        dr = Nothing
        sqlConnection = Nothing
End Sub

We first set a value for the physical directory of the index location. Again for demonstration we hard coded the values here, you could set up some configuration file for obtaining the values. You can see in the code above that we're making a call to the Indexer.AddDocument method, this method is contained in the Indexer class which focuses on interaction with the dotLucene API.

You can see in the indexer class, it is pretty thin, you basically call methods of dotLucene to create an instance of the catalog, and then populate a document with the values from your record. Another item of note is the method ReturnSortDate, this formats the date in the database to 20050501 so we can sort by date within our search.

Imports Lucene.Net.Documents
Imports Lucene.Net.Index
Imports Lucene.Net.Analysis.Standard
Imports System.Text.RegularExpressions
Imports System
Public Class Indexer
    Private writer As IndexWriter
    Public Sub New(ByVal Directory As String)
    End Sub
    Public Sub InitializeIndex(ByVal Directory As String)
        writer = New IndexWriter(Directory, New StandardAnalyzer, True)
    End Sub
    Public Sub AddDocument(ByVal DateAdded As String, _
    ByVal Description As String, ByVal URL As String, _
    ByVal Title As String)
            Dim doc As New Document
            doc.Add(Field.Keyword("date", DateAdded))
            doc.Add(Field.Text("description", Description))
            doc.Add(Field.Text("url", URL))           
            doc.Add(Field.Text("title", Title))
            doc.Add(Field.Keyword("sortdate", ReturnSortDate(DateAdded)))
        Catch ex As Exception
            'failed to add doc
        End Try
    End Sub
    Private Function ReturnSortDate(ByVal DateAdded As Date) As Integer
        'Returns 20050511 for sorting on date.
        Dim thisMonth As String = DateAdded.Month.ToString
        'add leading zero
        If thisMonth.Length = 1 Then thisMonth = "0" & thisMonth
        Dim thisYear As String = DateAdded.Year.ToString
        Dim thisDay As String = DateAdded.Day.ToString
        If thisDay.Length = 1 Then thisDay = "0" & thisDay
        Return thisYear & thisMonth & thisDay
    End Function
    Private Function Close()
    End Function
End Class


Searching Content


That's it, we have a searchable index ready for our search code in a Web form. dotLucene comes with search examples in the source distribution, so here we'll cover some basic searching of this index. The sample code includes paging routines, and binding of a DataTable to a Web form. In the next block of code we'll do a search on the catalog and return a DataTable with the results. The Search method will accept the search query passed from a textbox.

Imports Lucene.Net.Documents
Imports Lucene.Net.Index
Imports Lucene.Net.Analysis.Standard
Imports System.Text.RegularExpressions
Imports System
Public Class Searcher
    Dim searcher As IndexSearcher
    Sub New(ByVal Directory As String)
        searcher = New IndexSearcher(Directory)
    End Sub
      Public Function Search(ByVal Query As String, _
     Optional ByVal SortBy As String = "date") As DataTable
       ‘we’ll create our datatable structure to mirror the index.
             Dim Results As New DataTable
        Dim MyQuery As Query = QueryParser.Parse(Query, "description", _
          New StandardAnalyzer)
        Dim Sort As Sort = New Sort(SortBy, True)
        Dim Hits As Hits = searcher.Search(MyQuery, Sort)
        mTotalRecs = Hits.Length
        Dim iCount As Integer = 0
        While iCount < mTotalRecs
              Dim doc As Document = Hits.Doc(iCount)
               Dim row As DataRow = Results.NewRow
               row("url") = doc.Get("url")
               row("Title") = doc.Get("title")
               row("Description") = doc.Get("description")
               row("Published") = doc.Get("date")
              iCount = iCount + 1
        Return Results
     End Function
End Class

Now that you have a DataTable created with the results from your dotLucene search, you can bind it to a DataGrid for viewing on the Web or some other application. For example:

Dim searcher As New Searcher("c:\pathtoindex\")
Dim ds As New DataTable
Dim myQuery As String = myKeywords
ds = searcher.Search(myQuery, sort)

Where you pass the keywords or query to the Search method and the sort type. In our application the sort method parameter sent to dotLucene would be "date" for the date field we calculated earlier using the ReturnSortDate method used to format our date to a sortable integer.

About the Author, Patrick Santry

Patrick Santry, has two decades of experience in enabling businesses to take advantage of the digital landscape. A well rounded experience in technology, and business is what sets me apart from the rest of the pack. When it comes to an overall digital strategy my experience is impressive.

BS in Computer Information Systems. Four time recipient of the Microsoft MVP Award, and author of several books and magazine articles on digital technologies.

blog comments powered by Disqus


Thank you for visiting our site, before you leave, please visit some of the areas or information you may have missed.

Popular Articles