Monday, May 31, 2010

Alfresco: Loading Tag Categories

In my previous post, I described my content model. In this model, each post could be manually classified against one or more tags (similar to how one would do it on Blogger). The tags are stored as part of the Alfresco's taxonomy and shared between bloggers (we could have had this be private to each blogger also, but given that the idea of tagging is to build a shared folksonomy, I thought it would be better to have them be shared).

I pulled three Atom feeds for my example - one from my own blog, and two others from friends who also write on blogger. Then I parsed out the categories from the feeds and wrote out a de-duplicated set of tags from all three blogs out into a flat file. To parse the feeds, I used StAX - here is the code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// Source: src/java/com/mycompany/alfresco/extension/loaders/CategoryParser.java
package com.mycompany.alfresco.extension.loaders;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.PrintWriter;
import java.util.HashSet;
import java.util.Set;

import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;

import org.junit.Test;

public class CategoryParser {

  public void parse(String author, Set<String> cats) throws Exception {
    XMLInputFactory factory = XMLInputFactory.newInstance();
    XMLStreamReader parser = factory.createXMLStreamReader(
      new FileInputStream("/Users/sujit/Projects/Alfresco/" + 
      author + "_atom.xml"));
    for (;;) {
      int evt = parser.next();
      if (evt == XMLStreamConstants.END_DOCUMENT) {
        break;
      }
      if (evt == XMLStreamConstants.START_ELEMENT) {
        String tag = parser.getName().getLocalPart();
        if ("category".equals(tag)) {
          int nattrs = parser.getAttributeCount();
          for (int i = 0; i < nattrs; i++) {
            String attrname = parser.getAttributeLocalName(i);
            if ("term".equals(attrname)) {
              cats.add(parser.getAttributeValue(i));
            }
          }
        }
      }
    }
    parser.close();
  }
  
  @Test
  public void testParse() throws Exception {
    PrintWriter writer = new PrintWriter(
      new FileWriter(new File("/tmp/cats.txt")));
    Set<String> cats = new HashSet<String>();
    parse("happy", cats);
    parse("grumpy", cats);
    parse("bashful", cats);
    for (String cat : cats) {
      writer.println(cat);
    }
    writer.flush();
    writer.close();
  }
}

One thing that I've started doing recently is embed the @Test method in the main class itself, similar to how some people put in a main() method for testing. This is particularly useful if all you want the test to do is to run your class. That way, you can use a single Ant target to run all your classes, instead of having specific ones for each class. Here is the unittest target.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
  <target name="unittest" depends="setup,compile" description="run unit test">
    <junit printSummary="yes" haltonerror="true" haltonfailure="true" fork="true" dir=".">
      <test name="${test.class}" todir="./bin"/>
      <classpath refid="classpath.server"/>
      <classpath refid="classpath.build"/>
      <classpath path="${alfresco.web.dir}/WEB-INF/classes"/>
      <classpath path="bin"/>
      <sysproperty key="basedir" value="."/>
      <sysproperty key="dir.root" value="/Users/sujit/Library/Tomcat/alf_data"/>
      <formatter type="plain" usefile="false"/>
    </junit>
  </target>

Inserting the categories into Alfresco proved a bit trickier. There was no code example that I could find, either in Jeff Pott's book or on the web.

One way to do this is to manually add your categories in the alfresco/bootstrap/categories.xml, reinitialize the database and data directories, then startup the Alfresco web application. I suppose I could have done this, but it seemed a bit of overkill to me.

The next hint I found was in the Classification and Categories wiki page, which states:

To add categories to the cm:generalclassifiable classification, there first needs to be a node of type cm:category with a child association QName of cm:generalclassifiable and child association type cm:categories beneath a node of type cm:category_root. This node is the top of the classification.

Nodes can be created beneath this node of type cm:category and child association type cm:subcategories. These nodes defined the root categories for the classification. Further nodes of type cm:category and child association type cm:subcategories can be added beneath these nodes to define the category hierarchy. Secondary links can be used to include categories from one classification in another - these category nodes appear in both classifications. The category property and its defining aspect determines which classification applies.

Pretty simple, right? Yeah, I thought so too :-). But it does make sense if you read this really carefully, at the same time referring to the categories.xml file.

Towards the bottom of the categories.xml file is an empty subcategory of the root category, called "Tags". Presumably, this is the category that should be customized by applications. So I decided to hang off my "my:tag" category node off this, and put all my categories as child subcategories of this category. That way, I could add more categories that are application specific as siblings of the my:tag category. Something like this:

cm:category_root
  |
  +-- ...
  |
  +-- Tags
  |    |
  |    +-- MyCompany Post Tags (my:tag)
  |    |    |
  |    |    +-- xmlrpc
  |    |    |
  |    |    +-- ...

The wiki page said to look at the unit tests to see how the above should be done - the one I found was ADMLuceneCategoryTest - it wasn't exactly what I was looking for, but it did give me some useful pointers on how to go about doing this. Based on the code in here, I decided to use the Alfresco Foundation API.

The one disadvantage of using the Foundation API is that it takes a while for the ApplicationContext to spin up. But an advantage is that you can do this without the web application running (in fact, with the application running, it complained about port 50501 being already in use - but I believe that is something Mac OS specific). It would probably have been quicker to use one of the remote APIs to do this. Since I plan on using that anyway once I build the client for the CMS users, I decided to use the Foundation API for now. Here is the code to load the categories from the flat file generated in the previous step.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
// Source: src/java/com/mycompany/alfresco/extension/loaders/CategoryLoader.java
package com.mycompany.alfresco.extension.loaders;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.Serializable;
import java.util.Collection;
import java.util.List;
import java.util.Map;

import javax.transaction.UserTransaction;

import org.alfresco.model.ContentModel;
import org.alfresco.service.ServiceRegistry;
import org.alfresco.service.cmr.repository.ChildAssociationRef;
import org.alfresco.service.cmr.repository.NodeRef;
import org.alfresco.service.cmr.repository.NodeService;
import org.alfresco.service.cmr.repository.StoreRef;
import org.alfresco.service.cmr.search.CategoryService;
import org.alfresco.service.cmr.search.ResultSet;
import org.alfresco.service.cmr.search.SearchService;
import org.alfresco.service.cmr.security.AuthenticationService;
import org.alfresco.service.namespace.QName;
import org.alfresco.service.transaction.TransactionService;
import org.alfresco.util.ApplicationContextHelper;
import org.junit.Assert;
import org.junit.Test;
import org.springframework.context.ApplicationContext;

public class CategoryLoader {

  private static final String MYCOMPANY_POST_TAG_QUERY = 
    "PATH:\"cm:generalclassifiable/cm:Tags/" + 
    "cm:MyCompany_x0020_Post_x0020_Tags\"";

  private ApplicationContext ctx;
  
  public CategoryLoader() {
    this.ctx = ApplicationContextHelper.getApplicationContext();
  }
  
  public int loadCategories(String categoryFile) throws Exception {
    int numLoaded = 0;
    ServiceRegistry serviceRegistry = 
      (ServiceRegistry) ctx.getBean(ServiceRegistry.SERVICE_REGISTRY);
    CategoryService categoryService = serviceRegistry.getCategoryService();
    AuthenticationService authenticationService = 
      serviceRegistry.getAuthenticationService();
    authenticationService.authenticate("admin", "admin".toCharArray());
    TransactionService txService = serviceRegistry.getTransactionService();
    UserTransaction tx = txService.getUserTransaction();
    tx.begin();
    Collection<ChildAssociationRef> refs = categoryService.getRootCategories(
      StoreRef.STORE_REF_WORKSPACE_SPACESSTORE, 
      ContentModel.ASPECT_GEN_CLASSIFIABLE);
    NodeRef tagCategoryRef = null;
    for (ChildAssociationRef ref : refs) {
      if (ref.getQName().equals(ContentModel.PROP_TAGS)) {
        tagCategoryRef = ref.getChildRef();
        break;
      }
    }
    try {
      SearchService searchService = serviceRegistry.getSearchService();
      ResultSet resultSet = null;
      BufferedReader reader = null;
      try {
        resultSet = searchService.query(
          StoreRef.STORE_REF_WORKSPACE_SPACESSTORE, 
          SearchService.LANGUAGE_LUCENE, MYCOMPANY_POST_TAG_QUERY);
        NodeRef myPostTagsRef = null;
        if (resultSet.getChildAssocRefs().size() > 0) {
          myPostTagsRef = resultSet.getChildAssocRef(0).getChildRef();
        } else {
          myPostTagsRef = categoryService.createCategory(
            tagCategoryRef, "MyCompany Post Tags");
        }
        reader = new BufferedReader(new FileReader(categoryFile));
        String category = null;
        while ((category = reader.readLine()) != null) {
          System.out.println("Adding category: " + category);
          categoryService.createCategory(myPostTagsRef, category);
          numLoaded++;
        }
      } finally {
        if (resultSet != null) { resultSet.close(); }
        if (reader != null) { reader.close(); }
      }
      reader.close();
      tx.commit();
    } catch (Exception e) {
      tx.rollback();
      throw e;
    }
    return numLoaded;
  }

  public void deleteMyCompanyTags() throws Exception {
    ServiceRegistry serviceRegistry = 
      (ServiceRegistry) ctx.getBean(ServiceRegistry.SERVICE_REGISTRY);
    AuthenticationService authenticationService = 
      serviceRegistry.getAuthenticationService();
    authenticationService.authenticate("admin", "admin".toCharArray());
    String ticket = authenticationService.getCurrentTicket();
    TransactionService txService = serviceRegistry.getTransactionService();
    UserTransaction tx = txService.getUserTransaction();
    tx.begin();
    SearchService searchService = serviceRegistry.getSearchService();
    ResultSet resultSet = null;
    try {
      resultSet = searchService.query(
        StoreRef.STORE_REF_WORKSPACE_SPACESSTORE, 
        SearchService.LANGUAGE_LUCENE, MYCOMPANY_POST_TAG_QUERY);
      NodeRef myCompanyTagsRef = resultSet.getChildAssocRef(0).getChildRef();
      NodeService nodeService = serviceRegistry.getNodeService();
      CategoryService categoryService = serviceRegistry.getCategoryService();
      for (ChildAssociationRef caref : 
          nodeService.getChildAssocs(myCompanyTagsRef)) {
        categoryService.deleteCategory(caref.getChildRef());
      }
    } finally {
      if (resultSet != null) { resultSet.close(); }
    }
    tx.commit();
    authenticationService.invalidateTicket(ticket);
    authenticationService.clearCurrentSecurityContext();
  }
  
  public int verifyLoading() throws Exception {
    int numVerified = 0;
    ServiceRegistry serviceRegistry = (ServiceRegistry) ctx.getBean(
      ServiceRegistry.SERVICE_REGISTRY);
    AuthenticationService authenticationService = 
      serviceRegistry.getAuthenticationService();
    authenticationService.authenticate("admin", "admin".toCharArray());
    TransactionService txService = serviceRegistry.getTransactionService();
    UserTransaction tx = txService.getUserTransaction();
    tx.begin();
    try {
      NodeService nodeService = serviceRegistry.getNodeService();
      SearchService searchService = serviceRegistry.getSearchService();
      // find all nodes that are under our category folder
      ResultSet resultSet = null;
      try {
        resultSet = searchService.query(
          StoreRef.STORE_REF_WORKSPACE_SPACESSTORE, 
          SearchService.LANGUAGE_LUCENE, MYCOMPANY_POST_TAG_QUERY);
        NodeRef myCompanyTagsRef = resultSet.getChildAssocRef(0).getChildRef();
        List<ChildAssociationRef> carefs = 
          nodeService.getChildAssocs(myCompanyTagsRef);
        for (ChildAssociationRef caref : carefs) {
          NodeRef catRef = caref.getChildRef();
          Map<QName,Serializable> props = nodeService.getProperties(catRef);
          String name = (String) props.get(ContentModel.PROP_NAME);
          System.out.println("Verified: " + name);
          numVerified++;
        }
      } finally {
        if (resultSet != null) { resultSet.close(); }
      }
      tx.commit();
    } catch (Exception e) {
      tx.rollback();
      throw e;
    }
    return numVerified;
  }

  @Test
  public void testLoadCategories() throws Exception {
    CategoryLoader loader = new CategoryLoader();
    int loaded = loader.loadCategories(
      "/Users/sujit/Projects/Alfresco/cats.txt");
    int verified = loader.verifyLoading();
    Assert.assertEquals(loaded, verified);
  }
}

After running this, I can verify that the categories made it in on the Alfresco webapp's Admin console. The left panel shows the category hierarchy, while the right panel shows a icon view of the various categories that were just inserted. The breadcrumb above also shows the relative position of these category tag elements.

Categories are to Alfresco what Taxonomies are to Drupal. Once you understand how to load categories, it seems to be fairly simple to work with. You can nest categories to any depth as well (see the categories.xml for how to do this).

For comparison, Drupal does this with 4 database tables - 3 to define the taxonomy and term itself, and the fourth one to map the node to the term. Not that there are no warts with Drupal's implementation (there are 3 different ways the node->taxonomy element can be structured, depending on the type of taxonomy being used), but Drupal's approach seems simpler and more intuitive to me.

That said, one thing I do like about Alfresco's approach is its unified approach to category and content - both are nodes.

Update - 2010-06-04

I had a bug in the loading code, I forgot to add the aspect to the category node as I was loading it. I also added a verification step in the loader that runs against Alfresco's Lucene index once the loading is complete, and verifies that the number of categories in the input file are the same as that in Alfresco. The code has been updated in the main post.

The second thing I noticed was that I was creating a new alf_data in my project - this was because the dir.root was set to ./alf_data. I guess the reason I could see it in Alfresco's web client was because it comes from the database. I updated the dir.root directly in repository.properties and added that as a system property in my Ant task. The XML for the Ant task has also been updated in the main post.

Update - 2010-06-12

When trying to link a post to a category, I found that I had marked the categories with the my:tagClassifiable aspect. There should be no aspect applied to the category, it should be applied to the my:post content instead. The CategoryLoader code has been updated with this information. In addition, there is code to delete all the categories (since I had to do this before I reran).

Wednesday, May 26, 2010

Alfresco: Developing the Content Model

In a traditional system where you start building from scratch, you can generally break up a system into individual components once you have a reasonable idea of the way they are going to fit together, and then proceed with the development of each component in relative isolation. With a packaged system which you are trying to extend for your own purposes, I find that the development cycle is slightly different - you first have to figure out how the system works, and fit your application around it so you don't paint yourself into a corner.

In order to quickly get through the "learn how the system works" stage, I recommend reading both Munwar Sharif's Alfresco Enterprise Content Management Implementation and Jeff Pott's Alfresco Developer's Guide, preferably in that order (although I read them in reverse order). I found both books enormously helpful and informative. Of course, there is still no guarantee that I won't paint myself into a corner, but hopefully the chances are less :-).

The Content Model

Alfresco is at its core a Document Management (DM) System, its main storage unit is a file - you upload a file and either enter the metadata or build/use extractors that extract metadata from it. Contrast this with something like Drupal, which is really a Web Content Management System (WCM) - it provides forms to enter both content and metadata. Depending on your point of view, this could be an advantage or a disadvantage.

My projected use case is for a WCM, and I needed content objects to store user-entered metadata for blog posts, so I had to build my own model. Here is what I came up with.

XML Definition Files

The custom content model is defined using Spring configuration and Alfresco's own content model definition XML. First we create a Spring configuration file in the config/alfresco/extension directory, called mycompany-model-context.xml. When deployed, the contents of the config/alfresco/extension directory ends up under WEB-INF/classes/alfresco/extension in the alfresco webapp, where it knows to look for files with the pattern *-context.xml.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
<?xml version="1.0" encoding="UTF-8"?>
<!-- Source: config/alfresco/extension/mycompany-model-context.xml -->
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 
    'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans>
  <bean id="mycompany.dictionaryBootstrap" 
      parent="dictionaryModelBootstrap" 
      depends-on="dictionaryBootstrap">
    <property name="models">
      <list>
        <value>alfresco/extension/model/myModel.xml</value>
      </list>
    </property>
  </bean>
</beans>

As you can see, we define a model file in here for the "models" property. This model file contains our custom model. This is basically the diagram above written out in XML.

There is a lot to explain here. However, this seems to be one of the first things people try to do with Alfresco, so there are lots of blog posts and wiki pages that do so already. I will point out some things, but for the rest, you may want to take a look at the Alfresco Data Dictionary Guide referenced below.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
<?xml version="1.0" encoding="UTF-8"?>
<!-- Source: config/alfresco/extension/model/myModel.xml -->
<model name="my:mymodel" xmlns="http://www.alfresco.org/model/dictionary/1.0">

  <description>My Model</description>
  <author>Sujit Pal</author>
  <version>1.0</version>

  <!-- import base models -->
  <imports>
    <import uri="http://www.alfresco.org/model/dictionary/1.0" prefix="d"/>
    <import uri="http://www.alfresco.org/model/content/1.0" prefix="cm"/>
  </imports>

  <namespaces>
    <namespace uri="http://www.mycompany.com/model/content/1.0" prefix="my"/>
  </namespaces>

  <constraints>
    <constraint name="my:pubStates" type="LIST">
      <parameter name="allowedValues">
        <list>
          <value>Draft</value>
          <value>Review</value>
          <value>Published</value>
        </list>
      </parameter>
    </constraint>
  </constraints>

  <types>
    <!-- ==================================================== -->
    <!-- Represents the base document for this application    -->
    <!-- ==================================================== -->
    <type name="my:baseDoc">
      <title>Base Document</title>
      <description>Abstract Base Document for this application</description>
      <parent>cm:content</parent>
      <mandatory-aspects>
        <aspect>cm:ownable</aspect>
      </mandatory-aspects>
    </type>
    <!-- ==================================================== -->
    <!-- Represents a blog written by a user (represented by  -->
    <!-- a my:profile content). This has a 1:1 mapping to a   -->
    <!-- profile, ie, a blog must have an associated profile  -->
    <!-- It has a 0:n mapping to posts, ie, a blog can have 0 -->
    <!-- to n posts associated with it. Fields for it are     -->
    <!-- title, description and creation date.                --> 
    <!-- ==================================================== -->
    <type name="my:blog">
      <title>Blog Information</title>
      <description>Blog Level Information</description>
      <parent>my:baseDoc</parent>
      <properties>
        <property name="my:blogname">
          <type>d:text</type>
        </property>
        <property name="my:byline">
          <type>d:text</type>
        </property>
        <property name="my:user">
          <type>d:noderef</type>
        </property>
      </properties>
      <associations>
        <child-association name="my:posts">
          <title>Posts</title>
          <target>
            <class>my:post</class>
            <mandatory>false</mandatory>
            <many>true</many>
          </target>
        </child-association>
      </associations>
    </type>
    <!-- ==================================================== -->
    <!-- Represents a single blog post.                       -->
    <!-- ==================================================== -->
    <type name="my:post">
      <title>Blog Post</title>
      <description>Single Blog Post</description>
      <parent>my:baseDoc</parent>
      <properties>
        <property name="my:blogRef">
          <type>d:noderef</type>
        </property>
      </properties>
      <mandatory-aspects>
        <aspect>cm:titled</aspect>
        <aspect>my:tagClassifiable</aspect>
        <aspect>my:publishable</aspect>
      </mandatory-aspects>
    </type>

  </types>

  <aspects>

    <aspect name="my:publishable">
      <title>Publishable</title>
      <properties>
        <property name="my:pubState">
          <type>d:text</type>
          <multiple>true</multiple>
          <constraints>
            <constraint ref="my:pubStates"/>
          </constraints>
        </property>
        <property name="my:pubDttm">
          <type>d:datetime</type>
          <index enabled="true">
            <atomic>true</atomic>
            <stored>true</stored>
            <tokenised>both</tokenised>
          </index>
        </property>
        <property name="my:unpubDttm">
          <type>d:datetime</type>
          <index enabled="true">
            <atomic>true</atomic>
            <stored>true</stored>
            <tokenised>both</tokenised>
          </index>
        </property>
        <property name="my:furl">
          <type>d:text</type>
        </property>
      </properties>
    </aspect>

    <aspect name="my:tagClassifiable">
      <title>Category Tag</title>
      <parent>cm:classifiable</parent>
      <properties>
        <property name="my:tags">
          <title>Tags</title>
          <type>d:category</type>
          <mandatory>false</mandatory>
          <multiple>true</multiple>
          <index enabled="true">
            <atomic>true</atomic>
            <stored>true</stored>
            <tokenised>false</tokenised>
          </index>
        </property>
      </properties>
    </aspect>
  </aspects>

</model>

In my model, the central content type is my:post. It is publishable, so inherits from my:publishableDoc, which contains the properties necessary for specifying its current workflow state. It has a n:1 relation with my:blog, which is a container for a set of my:post objects. A my:post can also have a set of tags, which is modeled as an aspect (I copied this from the Classification and Categories page referenced below).

To test the model, you will need to run "ant deploy" and restart Tomcat. The "ant deploy" basically zips up the project and unzips it into the exploded Alfresco war file on Tomcat. Jeff Potts has made the code for his book available for direct download, which contains the build.xml, if you want it.

In addition, there is a quicker way (suggested by the Data Dictionary Guide wiki page), which just calls a Java class in the Alfresco JAR files. The Ant snippet is shown below. You still need to deploy and restart Tomcat once you are happy with your model, but this is good for iterative testing.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
  <target name="test-model" depends="setup" description="check model">
    <java dir="." fork="true" 
        classname="org.alfresco.repo.dictionary.TestModel">
      <classpath refid="classpath.server"/>
      <classpath refid="classpath.build"/>
      <classpath path="${alfresco.web.dir}/WEB-INF/classes"/>
      <classpath path="config"/>
      <arg line="alfresco/extension/model/myModel.xml"/>
    </java>
  </target>

Verfying the changes

You are not done yet, though... At this point, you will still not actually "see" your custom model on the Alfresco UI. For that, you will need to set up the property sheets for each of the content types. This is done in yet another XML file in the config/alfresco/extension directory called web-client-config-custom.xml (this file name is hardcoded in the Alfresco's config, so it looks for overrides here). Here is the file, without too much explanation - basically you are telling the UI to show certain fields for certain content types.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
<alfresco-config>
<!-- Source: config/alfresco/extension/web-client-config-custom.xml -->

  <config evaluator="aspect-name" condition="my:tagClassifiable">
    <property-sheet>
      <show-property name="my:tags" display-label-id="tags"/>
    </property-sheet>
  </config>

  <config evaluator="node-type" condition="my:blog">
    <property-sheet>
      <show-property name="my:blogname" display-label-id="blogname"/>
      <show-property name="my:byline" display-label-id="byline"/>
      <show-property name="my:user" display-label-id="user"/>
      <show-child-association name="my:posts"/>
    </property-sheet>
  </config>

  <config evaluator="node-type" condition="my:post">
    <property-sheet>
      <show-property name="my:pubState" display-label-id="pubState"/>
      <show-property name="my:pubDttm" display-label-id="pubDttm"/>
      <show-property name="my:unpubDttm" display-label-id="unpubDttm"/>
      <show-property name="my:furl" display-label-id="furl" read-only="true"/>
    </property-sheet>
  </config>

  <config evaluator="string-compare" condition="Content Wizards">
    <content-types>
      <type name="my:blog" display-label-id="blog"/>
      <type name="my:post" display-label-id="post"/>
    </content-types>
  </config>

  <config evaluator="string-compare" condition="Action Wizards">
    <aspects>
      <aspect name="my:tagClassifiable"/>
      <aspect name="my:publishable"/>
    </aspects>
    <subtypes>
      <type name="my:baseDoc"/>
      <type name="my:blog"/>
      <type name="my:post"/>
    </subtypes>
    <specialize-types>
      <type name="my:post"/>
    </specialize-types>
  </config>

</alfresco-config>

There is also an associated property file which puts human readable names for the display-label-id attributes in the file above, that looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# Source: config/alfresco/extension/webclient.properties

#my:tagClassifiable
tags=Category Tags

#my:blog
blogname=Blog Name
byline=Byline
user=User Information

#my:post
pubState=Current Workflow State
pubDttm=Scheduled Publish Date
unpubDttm=Scheduled Unpublish Date
furl=Consumer Friendly URL

# content wizard
blog=Blog
post=Post

Once you run "ant deploy" and restart Tomcat, you will see your custom content if you try to Add Content. On the second page, it prompts for Type, and I can ee Blog and Post in the dropdown. Here is the screenshot for this page, although I could not grab a screenshot with the dropdown opened.

Useful Links

Here are some Wiki Entries I found useful, along with the two books mentioned above.

Conclusion

Compared to Drupal, custom content model creation in Alfresco is a lot harder. Some of it probably has to do with the fact that Alfresco is written in Java, and servlet containers need to be restarted for changes to take effect, unlike LAMP systems. Also the use of XML for configuration more common in Java..

Alfresco's model, however, is better when it comes to deployment of the same model across multiple environments, a fairly typical scenario in most Java (and probably PHP/Drupal) shops. Because configuration is just a set of files, they can be developed once and applied to multiple environments.

Update - 2010-06-12

When uploading the blog and post data from my Atom dumps, I discovered that I had made several goof-ups when designing the content model. Specifically, these were:

  • Using d:path for friendly URL in my:post - The d:path value contains the Alfresco specific unique URI, not a web-friendly URL I was looking for. So I changed the name to furl and type to d:text.
  • 1:n association from my:blog to my:post - Not totally sure about this, but I seem to have made a mistake in configuring this. So I followed the example on this wiki page to define the 1:n relation by removing the source declaration. The fix works (in the sense that Alfresco does not complain), but I am still unable to use nodeService.createAssociation() to link a my:post to a my:blog. Need to investigate this further.

I have updated all the relevant bits of XML and properties files in this post for the above.

Update - 2010-06-25

I wanted to extend my model with some more types, and realized that having my:post extend my:publishableDoc was a bit restrictive, so decided to make an aspect my:publishable instead, which I then applied to my:post. I have updated the model XML file above.

One thing I noticed is that aspect properties are not indexable by default. I guess its because an aspect is like an interface, so meant to be applied as a marker during search. So in case you need you want to search on an aspect property as well, you will need to make it indexable.

Monday, May 17, 2010

Alfresco: Installation and Initial Thoughts

Currently, we are on the final stretch of delivering a product that combines the Drupal CMS with a Java-based publishing system. It has been a long and painful process getting to this point. One of the things that contributed to the pain is the opacity of the Drupal code, compounded by the fact that we don't have too much Drupal/PHP talent in-house. So at one point, I wondered if the process wouldn't have been easier had we chosen to work with a Java based CMS instead.

One well-known and fairly mature open-source Java based CMS is Alfresco. I decided to check it out and use it to build something similar to our Drupal based product. What's the objective of this apparently pointless exercise, you ask? Well, its mainly to learn about Alfresco and see how it compares to Drupal, really just curiosity. And no, its not to be able to switch out the Drupal component in the product with one based on Alfresco, in case you were wondering - that would be too risky, at least at this stage.

According to this somewhat controversial Infoworld article, Alfresco scores better than Drupal. However, the jury seems to be still out on that.

I think the best way to decide is to figure this out for myself. Prior to working with Drupal, I didn't really know what to look for in a CMS. Not that I know everything there is to know about this even now, but here is my set of "required features" for a CMS.

  • Custom Content - the ability to define custom content types specific to the application.
  • Profile - the ability to store user profile information, which may not be natively supported in the CMS user object. The reason I mention this separately is that the user object is usually distinct form a content object.
  • Import Content and Users - there should be some sort of API so I can import content and users form an external (possibly XML) source.
  • Users and Roles - CMS should support multiple users with different roles.
  • Workflow - documents will have to pass through multiple reviews before being published.
  • Relate content - a document in the CMS may be associated to to zero or more documents in the CMS.
  • Taxonomy - a document may be associated with multiple taxonomy vocabularies. The associations could be 1:1 or 1:n.
  • Enter/Maintain content - there should be a UI in order for users to enter new content and maintain existing content.
  • Interface to Publisher - should be able to send publish/unpublish commands to the current publisher interface.

With Drupal, we had the benefit of a consultant who helped us out with the installation, setup and initial learning curve. So I may be a bit biased towards Drupal because to me it is "simpler". However, with Alfresco, I am more familiar with the components used to build it (Spring, Lucene, Hibernate, JCR), so perhaps the bias will cancel itself out.

Since I am not an Alfresco expert, I plan on spending the next several weeks working through the various "requirements" and see how hard/easy it is compared to Drupal (much of this stuff is already implemented in our Drupal instance by our consultant, and some by me). At the end of this, I hope to have enough knowledge to be able to customize an Alfresco instance to a set of semi-realistic base requirements.

This week, all I've been able to do is to set up the Alfresco ECM client, basically a web application running on Tomcat. I've also set up an Eclipse project that will contain my customizations to the base Alfresco package to make it behave more like our Drupal installation. I describe them below:

Alfresco ECM Client setup

Prebuilt packages for Windows, Linux and MacOS exist for doing 1-click installs of Alfresco. I wanted to use the Tomcat that ships with my Mac (which I ended up upgrading later for a different reason) and the MySQL that I downloaded as part of MAMP for Drupal earlier, and because the prebuilt packages embed both these components, I didn't want to use the prebuilt packages.

So I initially downloaded the project from SVN, but was unable to build, so I downloaded the latest stable WARs, and popped the alfresco.war file into Tomcat's webapps folder. It complained about various things:

  • Tomcat running out of PermGen - the fix was to set a higher value for PermGen space based on this this Alfresco wiki page.
  • Missing ImageMagick and swftools - Alfresco complains about not finding these, so I needed to install them (sudo port install) and update the repository.xml file to point to the correct locations for these two tools.

I also had to create the alfresco database in MySQL and grant the alfresco user the appropriate rights as defined here.

At the end of a fairly long startup (the database gets populated with the tables the first time round, and the alf_data directory gets created and initialized), I was rewarded with the following page at http://localhost:8080/alfresco.


My first impression of the Alfresco ECM user interface is that its horribly complex compare to Drupal's, but then it could just be my unfamiliarity with it.

Customization Project setup (Eclipse)

I followed Jeff Potts's Alfresco Developer's Guide (see References below) almost to the letter while setting up my client customizations project. That way I could use the build.xml file in the code download for the book. The way the project is set up is that it zips up the files and unpacks them into an exploded Alfresco web application in Tomcat.

The directory structure for the project is as follows:

 PROJECT_ROOT
  |
  +-- src
  |   |
  |   +-- java
  |   |
  |   +-- web
  |       |
  |       +-- META-INF
  |       |
  |       +-- jsp
  |       |   |
  |       |   +-- extension
  |       |
  |       +-- mycompany
  |
  +-- config
       |
       +-- alfresco
            |
            +--- extension

I also downloaded the Alfresco JAR files and created a lib directory outside the project, so they don't get copied along with the project's ZIP file.

Initial Thoughts

Drupal appears to be more "complete" and intuitive (at least for my web developer intuitions). You can configure it to add your customizations, use its forms interface to generate content, and even use it to power your site's dynamic content pages, all in the same application. From my initial skimming of Munwar Sharif's book (see References below), the ECM can do most of what I want from it, but I just don't know how to do them yet. However, the general recommendation is to usually have a separate custom application for the CMS users, communicating with Alfresco's repository over REST/SOAP. For my web users, I would want the application to be decoupled from the ECM anyway, so the absence of a web front-end in Alfresco is a non-issue for me.

Drupal also has a lot more documentation freely available on the Internet. This is probably just because Alfresco is a younger project and its relatively harder to get into, so there are fewer people writing about it. However, there are at least two excellent books about it (see References below), which I suspect I will get much more familiar with over the coming weeks :-).

References

  • Alfresco Enterprise Content Management Implementation - by Munwar Sharif. I've just started reading this, so don't have much to say at this point.
  • Alfresco Developer Guide - by Jeff Potts. I've gone through this once already. There is an enormous amount of information in here, which I haven't digested completely either. Hopefully, as I work through my use cases, I will understand more.

Useful Links

Here are some links I found which I thought was useful. I list them below, hopefully you find them useful too.

Update - 2010-06-04

I wanted to have a way to override the repository.xml using my custom alfresco-global.properties, so I followed the instructions here and here, but no luck. Ended up adding these properties into the exploded repository.properties file instead. Not clean, but it works. Its probably as much work to maintain a custom version of alfresco.war as it is to maintain a Tomcat version customized for Alfresco.

Wednesday, May 05, 2010

Importing Nodes into Drupal using Java and XMLRPC

Regular readers may imagine that I am making up for lost time with this mid-week post. Actually, its just that I have a prior engagement over the weekend which will probably prevent me from posting then, and besides, I am done with this stuff, so hopefully it helps someone that much sooner :-).

Background

The idea is that given a machine readable set of post data, perhaps an RSS/Atom feed, or a dump from some other CMS, we should be able to import the posts into a local Drupal installation. I had toyed with using the Feeds module earlier, but while it works beautifully (with the PHP memory_limit upped to 64M) with an Atom feed from my blog, making it work with a non-standard XML feed will require custom subclasses of one or more of the FeedFetcher, FeedParser or FeedProcessor classes. Check out these screencasts if you are interested in setting up your import this way. A more general solution would be to point Drupal to a Java proxy web application which converts incoming custom formats to some sort of "common" uber-format, then have custom subclasses of the Feed components on the Drupal end (via a custom module) that would parse and process incoming nodes using a set of shared conventions. However, a colleague suggested using Drupal XMLRPC service, and that turns out to be much simpler and probably just as effective, so thats what I ended up doing.

For a proof of concept, I decided to use as input the RSS feed from my blog, containing 25 of the most recent articles and see if I could import them into Drupal over XMLRPC using a Java client. Along with the title and body, I also decided to import the original URL and pubDate fields into custom CCK fields, and the category tags into a custom taxonomy. Here is what I had to do.

Create a custom type in Drupal

If you are aggregating different types of feeds into Drupal, then it is likely that each feed will have some fields that are unique to that type of feed. So for each feed, I envision having to create a custom content type. Here is the content type (blogger_story) I created in Drupal for my blogger feed.

As you can see, its basically the Story type, with two CCK fields field_origurl and field_pubdate, as well as a Taxonomy field (after Title). The Taxonomy field is called Blogger_Tags, and attached to the BloggerStory type, as shown below:

Loosen up some permissions

I allowed anonymous users to create a blogger_story content type in order to bypass the Drupal authentication. This may or may not be good for your setup. I do login from the Java client, but my tests showed that passing in the resulting session id does not seem to make a difference - it kept giving me "Access Denied" until I allowed Anonymous users to create content. Its possible that I am missing some parameter setting here though.

Create a Java bean to hold the blog data

I created a simple JavaBean representing the BloggerStory content to hold the blog data as it is parsed off the XML, and use that to populate the XMLRPC parameter. Most of it is boilerplate, but I include it for completeness below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
// Source: src/main/java/com/mycompany/myapp/blogger/DrupalBloggerStory.java
package com.mycompany.myapp.blogger;

import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;

/**
 * Standard Drupal Story with two CCK fields, and a multi-select
 * taxonomy field.
 */
public class DrupalBloggerStory {

  public static final int BLOGGER_TAGS_VOCABULARY_ID = 1;
  public static final String TAGS_ATTR_NAME = "domain";
  public static final String TAGS_ATTR_VALUE = 
    "http://www.blogger.com/atom/ns#";

  private static final SimpleDateFormat PUBDATE_INPUT_FORMAT =
    new SimpleDateFormat("EEE, dd MMM yyyy HH:mm:ss Z");
  private static final SimpleDateFormat PUBDATE_OUTPUT_FORMAT =
    new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
  
  private String title;
  private List<String> tags = new ArrayList<String>();
  private String body;
  private String originalUrl;
  private Date pubDate;
  
  public String getType() {
    return "blogger_story";
  }
  
  public String getTitle() {
    return title;
  }
  
  public void setTitle(String title) {
    this.title = title;
  }
  
  public List<String> getTags() {
    return tags;
  }
  
  public void setTags(List<String> tags) {
    this.tags = tags;
  }
  
  public String getBody() {
    return body;
  }
  
  public void setBody(String body) {
    this.body = body;
  }
  
  public String getOriginalUrl() {
    return originalUrl;
  }
  
  public void setOriginalUrl(String originalUrl) {
    this.originalUrl = originalUrl;
  }

  /** Convert to Drupal's date format */
  public String getPubDate() {
    return PUBDATE_OUTPUT_FORMAT.format(pubDate);
  }

  /** Convert from RSS feed pubDate */
  public void setPubDate(String pubDate) throws ParseException {
    this.pubDate = PUBDATE_INPUT_FORMAT.parse(pubDate);
  }
}

Build the Feed Client

The feed client provides a simple wrapper over an Apache XMLRPC client and provides Java methods that look similar to the corresponding Drupal XMLRPC service method. It also handles the details of populating CCK and Taxonomy (multi-select only) fields. Here is the code for the feed client.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
// Source: src/main/java/com/mycompany/myapp/blogger/BloggerFeedClient.java
package com.mycompany.myapp.blogger;

import java.net.URL;
import java.util.Arrays;
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

import org.apache.xmlrpc.XmlRpcException;
import org.apache.xmlrpc.client.XmlRpcClient;
import org.apache.xmlrpc.client.XmlRpcClientConfigImpl;
import org.apache.xmlrpc.client.XmlRpcCommonsTransportFactory;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.mycompany.myapp.xmlrpc.CustomXmlRpcCommonsTransportFactory;

/**
 * Client to connect to the Drupal XMLRPC service. Exposes
 * the required services as client side Java method calls.
 */
public class BloggerFeedClient {

  private final Logger logger = LoggerFactory.getLogger(getClass());
  
  private XmlRpcClient client;
  
  public BloggerFeedClient(String serviceUrl) throws Exception {
    XmlRpcClientConfigImpl config = new XmlRpcClientConfigImpl();
    config.setServerURL(new URL(serviceUrl));
    this.client = new XmlRpcClient();
//    client.setTransportFactory(new XmlRpcCommonsTransportFactory(client));
    // logging transport - see my previous post for details
    client.setTransportFactory(new CustomXmlRpcCommonsTransportFactory(client));
    config.setEnabledForExceptions(true);
    config.setEnabledForExtensions(true);
    client.setConfig(config);
  }
  
  @SuppressWarnings("unchecked")
  public String userLogin(String user, String pass) throws XmlRpcException {
    Map<String,Object> result = 
      (Map<String,Object>) client.execute("user.login", 
      new String[] {user, pass});
    return (result == null ? null : (String) result.get("sessid"));
  }
  
  public void taxonomySaveTerms(int vocabularyId, Collection<String> terms)
      throws XmlRpcException {
    for (String term : terms) {
      Map<String,Object> termObj = new HashMap<String,Object>();
      termObj.put("vid", vocabularyId);
      termObj.put("name", term);
      int status = (Integer) client.execute(
        "taxonomy.saveTerm", new Object[] {termObj});
      logger.info("Added term:[" + term + "] " + 
        (status == 0 ? "Ok" : "Failed"));
    }
  }
  
  /**
   * "Implementation" (in the Drupal sense) of the node.save XMLRPC
   * method for DrupalBloggerStory.
   */
  public String bloggerStorySave(DrupalBloggerStory story, 
      Map<String,Integer> termTidMap) throws XmlRpcException {
    Map<String,Object> storyObj = new HashMap<String,Object>();
    storyObj.put("type", story.getType()); // mandatory
    storyObj.put("title", story.getTitle());
    storyObj.put("body", story.getBody());
    storyObj.put("field_origurl", mkCck(story.getOriginalUrl()));
    storyObj.put("field_pubdate", mkCck(story.getPubDate()));
    storyObj.put("uid", 1); // admin
    Map<String,List<String>> tags = new HashMap<String,List<String>>();
    tags.put(String.valueOf(
      DrupalBloggerStory.BLOGGER_TAGS_VOCABULARY_ID), story.getTags());
    storyObj.put("taxonomy", mkTaxonomy(termTidMap, tags));
    String nodeId = (String) client.execute(
      "node.save", new Object[] {storyObj});
    return String.valueOf(nodeId);
  }

  /**
   * CCK fields are stored as field_${field_name}[0]['value']
   * in the node object, so thats what we build here for the 
   * XMLRPC payload. I have seen other formats too, so check in
   * the page source for the node edit form.
   */
  @SuppressWarnings("unchecked")
  private List<Map<String,String>> mkCck(String value) {
    Map<String,String> struct = new HashMap<String,String>();
    struct.put("value", value);
    return Arrays.asList(new Map[] {struct});
  }

  /**
   * During editing forms, multi-select taxonomy entries are
   * stored as:
   * node->taxonomy["$vid"][$tid1, $tid2, ...]
   * Tag fields are stored differently:
   * node->taxonomy["tags"]["$vid"][$tid1, $tid2, ...] 
   * The entire thing is stored differently on node_load(), ie
   * when loading a node from the db.
   * node->taxonomy[$tid][stdClass::term_data]
   * We just handle the multi-select taxonomy field case here.
   */
  private Map<String,List<Integer>> mkTaxonomy(
      Map<String,Integer> termTidMap, 
      Map<String,List<String>> tags) {
    Map<String,List<Integer>> taxonomyValue = 
      new HashMap<String,List<Integer>>();
    for (String vid : tags.keySet()) {
      List<Integer> tids = new ArrayList<Integer>();
      for (String tag : tags.get(vid)) {
        tids.add(termTidMap.get(tag));
      }
      taxonomyValue.put(vid, tids);
    }
    return taxonomyValue;
  }
}

First pass: import vocabulary terms

To load terms we use the Drupal XMLRPC service taxonomy.saveTerm. I was doing this inline with the post import initially, but noticed that the service does not skip inserting terms which have already been added to the term_data (and term_hierarchy) tables. So I decided to do a first pass to extract all the category tags from the posts, sort them alphabetically and remove multiple occurrences, then shove into Drupal. Here's the taxonomy import code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
// Source: src/main/java/com/mycompany/myapp/blogger/BloggerFeedCategoryTaxonomyImporter.java
package com.mycompany.myapp.blogger;

import java.io.File;
import java.io.FileInputStream;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * Parses the XML file and extracts a non-duplicate list of
 * categories. We have to separate this out into its process
 * for two reasons:
 * 1) the Drupal taxonomy.saveTerm does not seem to be smart
 *    enough to recognize duplicate terms.
 * 2) the taxonomy.saveTerm does not return a tid value, instead
 *    it returns a 0 or 1 signifying success or failure, so
 *    its not possible to get the tid value for the term 
 *    inserted. We have to do a separate database call into
 *    the Drupal database to get this mapping when importing
 *    nodes.
 */
public class BloggerFeedCategoryTaxonomyImporter {

  private final Logger logger = LoggerFactory.getLogger(getClass());
  
  private BloggerFeedClient bloggerFeedClient;
  
  public BloggerFeedCategoryTaxonomyImporter(String serviceUrl, 
      String drupalUser, String drupalPass) throws Exception {
    bloggerFeedClient = new BloggerFeedClient(serviceUrl);
    bloggerFeedClient.userLogin(drupalUser, drupalPass);
  }
  
  public void importTerms(String inputFile) throws Exception {
    Set<String> terms = parseTerms(inputFile);
    List<String> termsAsList = new ArrayList<String>(terms);
    Collections.sort(termsAsList); // alphabetically
    logger.debug("Inserting terms: " + termsAsList);
    bloggerFeedClient.taxonomySaveTerms(
      DrupalBloggerStory.BLOGGER_TAGS_VOCABULARY_ID, termsAsList);
  }
  
  private Set<String> parseTerms(String inputFile) throws Exception {
    Set<String> terms = new HashSet<String>();
    XMLInputFactory factory = XMLInputFactory.newInstance();
    XMLStreamReader parser = factory.createXMLStreamReader(
      new FileInputStream(new File(inputFile)));
    boolean inItem = false;
    for (;;) {
      int evt = parser.next();
      if (evt == XMLStreamConstants.END_DOCUMENT) {
        break;
      }
      switch (evt) {
        case XMLStreamConstants.START_ELEMENT: {
          String tag = parser.getName().getLocalPart();
          if ("item".equals(tag)) {
            inItem = true;
          }
          if (inItem) {
            if ("category".equals(tag)) {
              int nAttrs = parser.getAttributeCount();
              for (int i = 0; i < nAttrs; i++) {
                String attrName = parser.getAttributeName(i).getLocalPart();
                String attrValue = parser.getAttributeValue(i);
                if (DrupalBloggerStory.TAGS_ATTR_NAME.equals(attrName) &&
                    DrupalBloggerStory.TAGS_ATTR_VALUE.equals(attrValue)) {
                  terms.add(parser.getElementText());
                }
              }
            }
          }
          break;
        }
        case XMLStreamConstants.END_ELEMENT: {
          String tag = parser.getName().getLocalPart();          
          if ("item".equals(tag) && inItem) {
            inItem = false;
          }
          break;
        }
        default: 
          break;
      }
    }
    parser.close();
    return terms;
  }
}

We run the importer using the JUnit snippet below. This results in 41 categories being written to the term_data and term_hierarchy tables in the Drupal database.

1
2
3
4
5
6
7
  @Test
  public void testImportTaxonomyTerms() throws Exception {
    BloggerFeedCategoryTaxonomyImporter importer =
      new BloggerFeedCategoryTaxonomyImporter(
      DRUPAL_XMLRPC_SERVICE_URL, DRUPAL_IMPORT_USER, DRUPAL_IMPORT_PASS);
    importer.importTerms(IMPORT_XML_FILENAME);
  }

Second pass: import the posts

We now create a slightly different parser to extract the various things we need to populate the BloggerStory data from the RSS XML file for my blog. Note that I could have just used the ROME FeedFetcher to parse the RSS into a SyndFeed, but the objective here is to be able to parse and load any XML feed, so I built a parser here. Here it is:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
// Source: src/main/java/com/mycompany/myapp/blogger/BloggerFeedImporter.java
package com.mycompany.myapp.blogger;

import java.io.File;
import java.io.FileInputStream;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.HashMap;
import java.util.Map;

import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * Parses a blog feed into a list of story beans and imports them into
 * Drupal.
 */
public class BloggerFeedImporter {

  private final Logger logger = LoggerFactory.getLogger(getClass());
  
  private Map<String,Integer> bloggerTagsMap;
  private BloggerFeedClient bloggerFeedClient;
  
  public BloggerFeedImporter(
      String serviceUrl, String drupalUser, String drupalPass, 
      String dbUrl, String dbUser, String dbPass) throws Exception {
    bloggerTagsMap = loadTaxonomy(
      DrupalBloggerStory.BLOGGER_TAGS_VOCABULARY_ID,
      dbUrl, dbUser, dbPass);
    logger.debug("bloggerTagsMap=" + bloggerTagsMap);
    bloggerFeedClient = new BloggerFeedClient(serviceUrl);
    bloggerFeedClient.userLogin(drupalUser, drupalPass);
  }

  public void importBlogs(String inputFile) throws Exception {
    XMLInputFactory factory = XMLInputFactory.newInstance();
    XMLStreamReader parser = factory.createXMLStreamReader(
      new FileInputStream(new File(inputFile)));
    DrupalBloggerStory story = null;
    boolean inItem = false;
    for (;;) {
      int evt = parser.next();
      if (evt == XMLStreamConstants.END_DOCUMENT) {
        break;
      }
      switch (evt) {
        case XMLStreamConstants.START_ELEMENT: {
          String tag = parser.getName().getLocalPart();
          if ("item".equals(tag)) {
            story = new DrupalBloggerStory();
            inItem = true;
          }
          if (inItem) {
            if ("pubDate".equals(tag)) {
              story.setPubDate(parser.getElementText());
            } else if ("category".equals(tag)) {
              int nAttrs = parser.getAttributeCount();
              for (int i = 0; i < nAttrs; i++) {
                String attrName = parser.getAttributeName(i).getLocalPart();
                String attrValue = parser.getAttributeValue(i);
                if (DrupalBloggerStory.TAGS_ATTR_NAME.equals(attrName) &&
                    DrupalBloggerStory.TAGS_ATTR_VALUE.equals(attrValue)) {
                  story.getTags().add(parser.getElementText());
                }
              }
            } else if ("title".equals(tag)) {
              story.setTitle(parser.getElementText());
            } else if ("description".equals(tag)) {
              story.setBody(parser.getElementText());
            } else if ("link".equals(tag)) {
              story.setOriginalUrl(parser.getElementText());
            }
          }
          break;
        }
        case XMLStreamConstants.END_ELEMENT: {
          String tag = parser.getName().getLocalPart();          
          if ("item".equals(tag) && inItem) {
            String nodeId = bloggerFeedClient.bloggerStorySave(
              story, bloggerTagsMap);
            logger.info("Saving blogger_story:[" + nodeId + "]: " + 
              story.getTitle());
            inItem = false;
          }
          break;
        }
        default: 
          break;
      }
    }
    parser.close();
  }

  private Map<String,Integer> loadTaxonomy(int vocabularyId,
      String dbUrl, String dbUser, String dbPass) throws Exception {
    Class.forName("com.mysql.jdbc.Driver").newInstance();
    Connection conn = DriverManager.getConnection(
      dbUrl, dbUser, dbPass);
    Map<String,Integer> termTidMap = new HashMap<String,Integer>();
    PreparedStatement ps = null;
    ResultSet rs = null;
    try {
      ps = conn.prepareStatement(
        "select name, tid from term_data where vid = ?");
      ps.setInt(1, DrupalBloggerStory.BLOGGER_TAGS_VOCABULARY_ID);
      rs = ps.executeQuery();
      while (rs.next()) {
        termTidMap.put(rs.getString(1), rs.getInt(2));
      }
      return termTidMap;
    } finally {
      if (rs != null) { try { rs.close(); } catch (SQLException e) {}}
      if (ps != null) { try { ps.close(); } catch (SQLException e) {}}
    }
  }
}

Notice that we loaded up a map of term to termId (tid) from the term_data table. This is because we only have the term values from our parsed content, but we need to populate the node with the map of vocabulary_id to the list of term_ids (not terms). Running this code using the JUnit snippet below:

1
2
3
4
5
6
7
8
  @Test
  public void testImportBloggerStories() throws Exception {
    BloggerFeedImporter importer = 
      new BloggerFeedImporter(DRUPAL_XMLRPC_SERVICE_URL, 
      DRUPAL_IMPORT_USER, DRUPAL_IMPORT_PASS, 
      DB_URL, DB_USER, DB_PASS);
    importer.importBlogs(IMPORT_XML_FILENAME);
  }

...results in the blogs in the feed imported into Drupal. Here is a screenshot of the preview page for one of the blog posts (the original is here). As you can see, all the fields and taxonomy entries came through correctly.

Using Drupal's XMLRPC interface to import data appears to be a fairly popular approach, going by the number of example clients in Python, PHP and C# available on the web to do this. I haven't seen one in Java though, so hopefully this post fills the gap. Note that this example may not be enough for you - you may need to import users, for example, or access image or node reference fields which are stored in a slightly different structure than the CCK fields - for that, you should look for hints in the edit form. But hopefully, this is a useful starting point.