Integrating Mailman with a Swish-e search engine

From WPKG | Open Source Software Deployment and Distribution
Jump to: navigation, search

Mailman is a very popular mailing list manager.

Unfortunately, one feature Mailman doesn't provide is searching its archives. Note that although Mailman can be integrated with Google search, this method is discouraged - it normally takes several weeks until Googlebot crawls your new posts.

However, Mailman can be easily integrated with existing open-source indexing systems, like Swish-e, which we will document here.


Contents

[edit] Prerequisites

This HOWTO assumes:


[edit] Apache configuration

Swish-e uses Perl; Mailman uses Python. This means that we probably need to tell Apache how to parse .cgi files. Your Apache needs mod_perl. To parse Perl .cgi files, you need to add +ExecCGI to Mailman directory options. Apache also needs to know that it has to parse some or all HTML files (later, you will decide if you want to have the search form on all Mailman pages, or just on thread.html, subject.html, author.html and date.html):

Options -Indexes +FollowSymLinks +ExecCGI +Includes

### You can comment out XBitHack if you want a search form on all Mailman pages/messages
XBitHack on

###  Uncomment these if you want to have a search form on all pages
# AddHandler server-parsed .html # for Apache 1.3
# AddOutputFilter INCLUDES .html   # for Apache 2.x


See http://httpd.apache.org/docs/2.2/howto/ssi.html#configuring for more info on configuring Server Side Includes (SSI) in Apache.


[edit] Swish-e configuration

[edit] Indexing configuration

# mkdir /srv/www/vhosts/wpkg.org/swish


# Index file - this is what Swish will create
IndexFile /srv/www/vhosts/wpkg.org/swish/lists.wpkg.org.index

# Root of our Mailman archives - everything under here will be indexed
IndexDir /srv/www/vhosts/wpkg.org/mailman/archives/public

# We want to index .html files only
IndexOnly .html

# Don't index summary pages: author.html, date.html etc.
FileRules filename is (author\.html|date\.html|index\.html|subject\.html|thread\.html)

# Replace local (physical) path with the web-accessible path
ReplaceRules replace "/srv/www/vhosts/wpkg.org/mailman/archives/public/" "pipermail/"

# Store description in search results
IndexContents HTML .html
StoreDescription HTML <pre> 200000

# Look at the title, too
MetaNames swishtitle

FollowSymLinks yes


# swish-e -c /srv/www/vhosts/wpkg.org/swish/lists.wpkg.org.config
Indexing Data Source: "File-System"
Indexing "/srv/www/vhosts/wpkg.org/mailman/archives/public"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 10,755 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
10,755 unique words indexed.
5 properties sorted.
2,283 files indexed.  11,588,318 total bytes.  699,952 total words.
Elapsed time: 00:00:02 CPU time: 00:00:02
Indexing done!

[edit] Web search configuration

Indexing is done - now, it's time to set up a search on your Mailman pages.

# cp /usr/lib/swish-e/swish.cgi /srv/www/vhosts/wpkg.org/mailman/cgi-bin/
# chmod 755 /srv/www/vhosts/wpkg.org/mailman/cgi-bin/swish.cgi


return {
    title        => 'Search WPKG mailing lists',
    swish_binary => '/usr/bin/swish-e',
    swish_index  => '/srv/www/vhosts/wpkg.org/swish/lists.wpkg.org.index',

# I wanted to hide some fields I didn't use - compare it with the values in swish.cgi.
# Default values are commented out.

#   secondary_sort  => [qw/swishlastmodified desc/],
    secondary_sort  => [qw/swishtitle/],
#   sorts           => [qw/swishrank swishlastmodified swishtitle swishdocpath/],
    sorts           => [qw/swishrank swishtitle swishdocsize/],
#   metanames       => [qw/ swishdefault swishtitle swishdocpath /],
    metanames       => [qw/ swishdefault swishtitle /],
#   display_props   => [qw/swishlastmodified swishdocsize swishdocpath/],
    display_props   => [qw/swishdocsize/],
}
--- swish.cgi.orig      2007-11-25 16:16:39.000000000 +0100
+++ swish.cgi   2007-11-29 23:13:18.000000000 +0100
@@ -1679,14 +1679,14 @@
 
     # Set the layout:
 
-    my $string = '<br>Limit to: '
-                 . ( $fields{buttons} ? "$fields{buttons}<br>" : '' )
-                 . ( $fields{date_range_button} || '' )
-                 . ( $fields{date_range_low}
-                     ? " $fields{date_range_low} through $fields{date_range_high}"
-                     : '' );
-
-    return $string;
+#    my $string = '<br>Limit to: '
+#                 . ( $fields{buttons} ? "$fields{buttons}<br>" : '' )
+#                 . ( $fields{date_range_button} || '' )
+#                 . ( $fields{date_range_low}
+#                     ? " $fields{date_range_low} through $fields{date_range_high}"
+#                     : '' );
+#
+#    return $string;
 }

(Note that if you do not comment that code out, and date options still don't show up on the search page, you may be missing the Date::Calc module required by swish.cgi - see http://swish-e.org/docs/swish.cgi.html - you can test this from the command line with perl -e 'require Date::Calc' which should have no output.)


my $DEFAULT_CONFIG_FILE = '/srv/www/vhosts/wpkg.org/swish/swishcgi.conf';


[edit] Integrating the search with Mailman's pages

If search works - congratulations. Now it's time to integrate the search form with some of the Mailman's pages. We will do it by a simple Server Side Include (SSI) - <!--#include virtual="/swish_mm.cgi" --> added to Mailman pages. Did you notice swish_mm.cgi here? It is there for a reason.

swish.cgi generates a whole HTML page, that is, with all <html>, <body> etc. tags. As Mailman's pages already include these tags we have to make sure these tags are not added by swish.cgi again.

Copy swish.cgi to swish_mm.cgi and make these changes:

--- swish.cgi   2007-11-29 23:19:45.000000000 +0100
+++ swish_mm.cgi        2007-11-29 23:33:07.000000000 +0100
@@ -451,7 +451,7 @@
         # TemplateDefault is the default
 
         xtemplate => {
-            package     => 'SWISH::TemplateDefault',
+            package     => 'SWISH::TemplateDefault_MM',
         },
 
         xtemplate => {
@@ -770,7 +770,7 @@
 
 
     # load the templating module
-    my $template = $conf->{template} || { package => 'SWISH::TemplateDefault' };
+    my $template = $conf->{template} || { package => 'SWISH::TemplateDefault_MM' };
     load_module( $template->{package} );


# cp /usr/lib/swish-e/perl/SWISH/TemplateDefault.pm /usr/lib/swish-e/perl/SWISH/TemplateDefault.pm.orig
# cp /usr/lib/swish-e/perl/SWISH/TemplateDefault.pm /usr/lib/swish-e/perl/SWISH/TemplateDefault_MM.pm


--- TemplateDefault.pm  2005-06-19 00:52:52.000000000 +0200
+++ TemplateDefault_MM.pm       2007-11-29 23:41:14.000000000 +0100
@@ -2,7 +2,7 @@
 # These routines format the HTML output.
 #    $Id: TemplateDefault.pm,v 1.3 2003/05/13 06:11:33 whmoseley Exp $
 #=====================================================================
-package SWISH::TemplateDefault;
+package SWISH::TemplateDefault_MM;
 use strict;
 
 use CGI;
@@ -63,14 +63,14 @@
                : $results->config('logo') || $default_logo;
 
     return <<EOF;
-<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
+<!-- <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
 <html>
     <head>
        <title>
           $html_title
        </title>
     </head>
-    <body>
+    <body>-->
         <h2>
         $logo$title $message
         </h2>
@@ -124,7 +124,7 @@
 
 
     return <<EOF;
-    <form method="get" action="$form" enctype="application/x-www-form-urlencoded" class="form">
+    <form method="get" action="/swish.cgi" enctype="application/x-www-form-urlencoded" class="form">
         <input maxlength="200" value="$query" size="32" type="text" name="query"/>
         $hidden
         <input value="Search!" type="submit" name="submit"/><br>
@@ -337,11 +337,11 @@
 
     $links
     <hr>
-    <small>Powered by <em>Swish-e</em> <a href="http://swish-e.org">swish-e.org</a></small>
+<!--    <small>Powered by <em>Swish-e</em> <a href="http://swish-e.org">swish-e.org</a></small>
     $mod_perl
     $valid_html_logo
     </body>
-</html>
+</html>-->
 EOF
 }

[edit] Mailman configuration

Now it's time to edit Mailman template files so that Mailman pages include a search form. If you just want a search form on thread.html, subject.html, author.html and date.html, you need to add <!--#include virtual="/swish_mm.cgi" --> to three Mailman templates: archidxhead.html, archtoc.html and archtocnombox.html. It is very important that you do NOT edit the templates in MAILMANDIR/templates/en (because you would lose your changes later if you upgraded Mailman). Instead, create a directory at MAILMANDIR/templates/site/en, copy the templates you want to update to this new directory and edit the site files.

If you use the default English language in Mailman, you will find these files in templates/en directory of your Mailman installation. The change is simple - an example below:

--- archidxhead.html.orig       2007-11-29 23:51:43.000000000 +0100
+++ archidxhead.html    2007-11-29 00:15:03.000000000 +0100
@@ -8,6 +8,7 @@
   <BODY BGCOLOR="#ffffff">
       <a name="start"></A>
       <h1>%(archive)s Archives by %(archtype)s</h1>
+   <!--#include virtual="/swish_mm.cgi" -->
       <ul>
          <li> <b>Messages sorted by:</b>
                %(thread_ref)s

If you want to have a search form also on every Mailman's archived message page, do a similar change in article.html.

Once you have made the changes to the templates, you MUST restart the Mailman process, since ArchRunner keeps a cache of the templates.

[edit] Recreating Mailman's archive

If you already have a list archive, you will need to recreate it to apply all these changes. To do this, you need a mbox file which is created by Mailman. An example - below:

# /srv/www/vhosts/wpkg.org/mailman/bin/arch --wipe wpkg-users wpkg-users.mbox


If you executed the above command as root, make sure to restore proper permissions:

# chown -R mailman:mailman /srv/www/vhosts/wpkg.org/mailman/archives


That's it! Now check if search is integrated with your Mailman pages.

[edit] Adding crontab entries

You will want to crawl your archive periodically. Also, if you only want to have the search form on thread.html, subject.html, author.html and date.html pages, you have to add execute bit to them.

How often you do it will depend on the size of your list and the traffic it gets.

I run these two commands every hour (note - this is not crontab entry, just commands you need to start with crontab):

# Crawl the archive
swish-e -c /srv/www/vhosts/wpkg.org/swish/lists.wpkg.org.config &>/dev/null

# If you use "XBitHack on", Apache should parse only these files
find /srv/www/vhosts/wpkg.org/mailman/archives/private -name thread.html -or -name index.html \
     -or -name date.html -or -name subject.html -or -name author.html | xargs chmod 755

Also, you will probably need to add such entries to default Mailman's cron file - otherwise:

HOME=/tmp
MAILTO=your@email

Without HOME, it didn't work with my cron.

Personal tools
Namespaces
Variants
Actions
Navigation
ideas?
Toolbox