Wednesday, November 14, 2007

 

Using Hibernate Search with complex requirements

Authors:
Roberto Bicchierai (e-mail: rbicchierai@open-lab.com)
Pietro Polsinelli (e-mail: ppolsinelli@open-lab.com)

Introduction

We’ll suppose here that you have your Java web application, persisted on Hibernate, and you want to enable full-text searches.

Your main aim is to have a single search page which will find stuff wherever it is in your application, whether it is on persisted objects fields, "long" text fields (clobs), attached documents (blobs), uploaded files (say on the file system).

Adding Hibernate Search to your toolset will probably directly meet most of your requirements, but for more refined requirements and flexibility, you may need to add some functionality; here you will find some inspiration and tricks to build your solution.

In the following we assume some familiarity with basic indexing and Lucene concepts, such as can be gained from the excellent “Lucene in Action” book (see References).

Problem set

You start with (hopefully, a subset) of this problem set:

1. Simplicity. Want to keep maintainable code: hence you want to keep persistence and indexing configuration "in the same place"

2. Lob fields. Have fields on objects which may be clob/blob type, or link to files on the file system

3. Multi documents. Have objects of type "document", which themselves may have attachments, which have a lifecycle contained in main objects, and you want to search on them, but find the main objects as referrals: e.g. task/documents.

4. Languages. You have to index data in different languages. This is not a remote case; for example, in all countries of the European Union (excluding UK and Eire), you will have at least data and documents in the country’s own language and in English.

5. Document formats. You have documents in various formats:

6. Paging. You want to present paginated results

7. Security. As in all cases of full-text search, you may have problems concerning security

Solutions

In considering these solutions, keep in mind that Hibernate Search is a very flexible framework: so the solutions we propose here are just one way to do it among many.

1. Simplicity. Wanting to keep maintainable code, hence keeping persistence and indexing configuration "in the same place" is a way of describing Hibernate Search.

One of the nice effects of using Hibernate Search, is that when you delete a persistent object, the Lucene index is also updated and all documents referring to the entity are removed too.

2. Lob fields. A solution perfectly in line with Hibernate Search is provided by our wiki contribution here:
http://www.hibernate.org/432.html

3. Multi documents. Here we introduce an original solution: a very simple scheduled job which handles a queue of documents, modeled by the class DataForLucene, to be indexed asynchronously. This approach gives you more flexibility, allowing to freely manage the number of Lucene documents associated to the same entity, and how and what gets in the index.

Here is the DataForLucene class:

/**
* Written by
* Roberto Bicchierai rbicchierai@open-lab.com
* Pietro Polsinelli ppolsinelli@open-lab.com
* for the Teamwork Project Management application - http://www.twproject.com
*/
public class DataForLucene {

public Serializable id;
public Class clazz;
public Serializable areaid;
public PersistentFile pf;
public long expiry;

public void indexMe() {

PersistenceContext pc = null;
IndexWriter w = null;
try {

pc = HibernateFactory.newFreeSession();

FullTextSession fullTextSession = Search.createFullTextSession(pc.session);
SearchFactory searchFactory = fullTextSession.getSearchFactory();

clazz = (Class) Class.forName(PersistenceHome.deProxy(clazz.getName()));
DirectoryProvider[] provider = searchFactory.getDirectoryProviders(clazz);
org.apache.lucene.store.Directory directory = provider[0].getDirectory();
String content = TextExtractor.getContent(pf, pc);
String guessedLanguage = IndexingBricks.guess(content);

w = new IndexWriter(directory, true, new SnowballAnalyzer(IndexingBricks.stemmerFromLanguage(guessedLanguage)));

String abstractOfContent = JSP.limWr(content, 5000);

Document doc = new Document();

Field classField =
new Field(
DocumentBuilder.CLASS_FIELDNAME,
clazz.getName(),
Field.Store.YES,
Field.Index.UN_TOKENIZED);
doc.add(classField);

Field docidField =
new Field(
"id",
id.toString(),
Field.Store.YES,
Field.Index.UN_TOKENIZED);
doc.add(docidField);

Field contentField =
new Field(
"content",
content,
Field.Store.NO,
Field.Index.TOKENIZED);
doc.add(contentField);

Field absField =
new Field(
"abstract",
abstractOfContent,
Field.Store.COMPRESS,
Field.Index.UN_TOKENIZED);
doc.add(absField);

Field fullcontentField =
new Field(
"fullcontent",
content,
Field.Store.NO,
Field.Index.TOKENIZED);
doc.add(fullcontentField);

Field pfField =
new Field(
"persistentFile",
pf.serialize(),
Field.Store.YES,
Field.Index.UN_TOKENIZED);
doc.add(pfField);

Field areaId =
new Field(
"area.id",
"" + areaid,
Field.Store.YES,
Field.Index.UN_TOKENIZED);
doc.add(areaId);

Field language =
new Field(
"language",
guessedLanguage,
Field.Store.YES,
Field.Index.UN_TOKENIZED);
doc.add(language);

w.addDocument(doc);

} catch (Throwable throwable) {
Tracer.platformLogger.error(throwable);
} finally {
if (w != null)
try {
w.close();
} catch (IOException e) {
Tracer.platformLogger.error(e);
}
if (pc != null)
try {
pc.commitAndClose();
} catch (PersistenceException e) {
Tracer.platformLogger.error(e);
}
}
}

public boolean equals(Object o) {
return this.compareTo(o) == 0;
}

public int hashCode() {
return (clazz.getName() + id).hashCode();
}

public int compareTo(Object o) {
DataForLucene forLucene = ((DataForLucene) o);
return (clazz.getName() + id + pf.serialize()).compareTo(forLucene.clazz.getName() + forLucene.id + forLucene.pf.serialize());
}

}


Here is the simplest possible indexing machine implementation:

/**
* Written by
* Roberto Bicchierai rbicchierai@open-lab.com
* Pietro Polsinelli ppolsinelli@open-lab.com
* for the Teamwork Project Management application - http://www.twproject.com
*/
public class IndexingMachine extends TimerTask {

public static IndexingMachine machine = new IndexingMachine();
public long tick = 10000;
private boolean stopped = true;
private boolean indexing = false;

private static List toBeExecuteds = new ArrayList();

private IndexingMachine() {
}

public static void start() {

machine.stopped = false;

if (!machine.indexing) {
machine.run();
}
}

public static void stop() {
machine.stopped = true;
}

public void run() {

if (toBeExecuteds.size() > 0) {
DataForLucene ij = toBeExecuteds.get(0);
synchronized (toBeExecuteds) {
toBeExecuteds.remove(0);
}
indexing = true;
ij.indexMe();
indexing = false;
}

if (toBeExecuteds.size() > 0)
tick = 20;
else
tick = 10000;

if (!machine.stopped && !machine.indexing) {
Timer t = new Timer(false);
machine = new IndexingMachine();
machine.stopped = false;
t.schedule(machine, tick);
}
}

public static void addToBeIndexed(Identifiable i, Serializable areaId, PersistentFile pf) {
DataForLucene ij = new DataForLucene();
ij.clazz = i.getClass();
ij.id = i.getId();
ij.areaid = areaId;
ij.pf = pf;
if (!toBeExecuteds.contains(ij))
synchronized (toBeExecuteds) {
toBeExecuteds.add(ij);
}
}

public static int getQueueSize() {
return toBeExecuteds.size();
}

public static boolean isRunning() {
return !machine.stopped;
}

public static boolean isIndexing() {
return machine.indexing;
}

}

Notice that in DataForLucene we must put references to the persistent entity in a way compatible with Hibernate Search indexing, so that for example on entity deletion you will get removal also of these documents. More generally, you need to have a uniform index structure to be able to query it.

4. Languages. You need to know the language in which a document is written, in order to correctly index it; once you know the language, you can instantiate say the Snowball analyzer with the correct language stemmer. To make a practical system, you will need to guess the documents language from its content. We have found a very simple and effective solution, based on TCatNG, which is released under BSD, and. You actually need 10 classes (instead of 117 in the sources), only a subset of those directly in the package pt.tumba.ngram.
In order to make a content “findable” also when searching from a language (say, German) a document in another language (say, English), we actually double indexed the content field, once with the nowball analyzer and once with the simple StopAnalyzer; so that if you are searching from German and you search “Telefunken”, which stemmed would be searched as “Telefunk”, will find also “Telefunken” in English documents ? .

5. Document formats. We provide a simple implementation of text extraction, inspired as always by a “keep it simple” philosophy, for which the Luis sources gave us some help. For a more mature (and complex) approach, check out the Nutch project.

private static String extractFromStream(String fileName, InputStream inputStream) throws Exception {

StringWriter sw = new StringWriter();
String content ="";

if (fileName.endsWith(".pdf")) {
PDFTextStripper stripper = new PDFTextStripper();
PDDocument document = PDDocument.load(inputStream);
stripper.writeText(document, sw);
content = content + sw.getBuffer().toString();
document.close();

} else if (fileName.endsWith(".doc")) {
WordExtractor we = new WordExtractor();
content = we.extractText(inputStream);

} else if (fileName.endsWith(".htm") || fileName.endsWith(".html")) {
Node root = getDOMRoot(inputStream);
content = getTextContentOfDOM(root);

} else if (fileName.endsWith(".ppt")) {
PPTIndexer reader = new PPTIndexer();
content = reader.getContent(inputStream);

} else if (fileName.endsWith(".xls")) {
ExcelIndexer reader = new ExcelIndexer();
content = reader.getContent(inputStream);

} else if (fileName.endsWith(".rtf")) {
DefaultStyledDocument sd = new DefaultStyledDocument();
RTFEditorKit kit = new RTFEditorKit();
kit.read(inputStream, sd, 0);
content = sd.getText(0, sd.getLength());

} else if (fileName.endsWith(".zip") || fileName.endsWith(".war") || fileName.endsWith(".jar")) {
Set files = Zipping.getZipContents(inputStream);
for (File file : files) {
FileInputStream fis = new FileInputStream(file);
content = content + extractFromStream(file.getName(),fis);
fis.close();
}

} else if (fileName.endsWith(".txt") || fileName.endsWith(".log")) {
StringBuffer sb = new StringBuffer();
BufferedReader br = new BufferedReader(new InputStreamReader(inputStream));
String line;
while ((line = br.readLine()) != null) {
sb.append(line);
sb.append(" ");
}
content = sb.toString();
}
return content;
}

6. Paging. For this, you can use out of the box our paging solution, as Hibernate Search results are usual Hibernate Pages: see

http://www.hibernate.org/314.html

7 Security. For this, you may need to add data to the index: suppose you have am ASP based service, like our Teamwork. As a first rough approximation, you could save the ASP account to which the document belong: we did this by having the account (which we call area) as an @IndexedEmbedded class, and doing the same on our DataForLucene.
Hibernate Search supports the filter notion, but this seems a bit limited as it acts at a Lucene document level.

References

Lucene:
http://lucene.apache.org

Hibernate Search:
http://www.hibernate.org/410.html

Language guessing API, TCatNG :
http://tcatng.sourceforge.net

Offline text extraction with Hibernate Search: our contribution on Hibernate’s Wiki:
http://www.hibernate.org/432.html

Paging with Hibernate: our contribution on Hibernate’s Wiki
http://www.hibernate.org/314.html

Luis search engine:
http://sourceforge.net/projects/lius

The Nutch project:
http://lucene.apache.org/nutch

“Lucene in Action”, ,Otis Gospodnetic and Erik Hatcher, Manning, 2004
http://www.manning.com/hatcher2

Sample classes are taken from Teamwork:
http://www.twproject.com

Labels:


Tuesday, November 06, 2007

 

Our contribution to Hibernate Search project:

Hibernate Search: Offline text extraction

Suppose that using Hibernate Search you want to index not only the standard persistent content of your objects, like string contents such as name, description etc., but also external references to files, such as PDF documents, HTML contents and so on.

We are going to address the following problem: if you use Hibernate Search in the simplest way to index such properties of your indexed objects, text extraction will happen at the same time as the storing of the objects, and hence in a transactional scope, hanging thread completion until text extraction is completed, even if indexing is done asynchronously, which is an option in Hibernate Search.
full article...
Full text search is among lots of exciting new features that we will be releasing in the next few months.

This page is powered by Blogger. Isn't yours?