Jun 02, 15
When using Lucene in Java or Scala, you may be tempted to skip the QueryParser and use the “DAO” (for lack of a better term) to construct queries using the classes provided. It is generally a best practice to use DAOs and such abstractions when available over raw query compilation for a variety of reasons, foremost being security (implicit injection protection) and query syntax integrity.
However, you may experience perplexing, incorrect result sets with your Lucene query if the following circumstances are true:
- Your index is written with an analyzer other than the default StandardAnalyzer (e.g. EnglishAnalyzer or any of the plethora of others).
- Your query is a boolean query with n number of OR (aka SHOULD) clauses where n ≥ 2.
- Your query requires a minimum m number of boolean clauses should match where m ≥ n.
Ordinary query, incorrect results
Here is a simple example of a query that exhibits the latter two circumstances above as built entirely with the DAO (code examples henceforth using Scala for brevity):
// Boolean query with OR clauses
val q = new BooleanQuery
q.add(new TermQuery(new Term("articleTitle", "thanks")), BooleanClause.Occur.SHOULD)
q.add(new TermQuery(new Term("articleTitle", "obama")), BooleanClause.Occur.SHOULD)
q.add(new TermQuery(new Term("articleTitle", "barack")), BooleanClause.Occur.SHOULD)
// Match at least 2 of the clauses
This query in plain English means “find documents that contain at least 2 of the terms thanks, obama, and barack.”
Now imagine an index of documents as follows written with EnglishAnalyzer (or some other non-StandardAnalyzer):
1 president jails congress; thanks, obama!
2 obama thanks al qaeda for joining in the fight against isis
3 obama to produce presidential library consisting entirely of ebooks
Running the query on the above documents should yield 2 hits—documents #1 and #2. However, you will receive 0 results in Lucene 3.5.0 (and possibly other versions; did not check).
Unfortunately for me, I was stuck with Lucene 3.5.0 in this particular codebase. Luckily I found a way to sidestep the bug by avoiding the DAO for at least part of the query construction.
Same query, but without DAO (and working now!)
val analyzer = new EnglishAnalyzer // or the same analyzer used to write the index
val qp = new QueryParser(Version.LUCENE_35, "articleTitle", analyzer)
val q = new BooleanQuery
// Match at least 2 of the terms
Surprise, surprise, this works! Documents #1 and #2 from before will match as expected.
Note on protecting against query injection
If you must use
QueryParser.parse as in the case above, you should also make it a habit to use
QueryParser.escape (a static method) on the string you pass to the parse method (e.g.
myQueryParser.parse(QueryParser.escape("potentially dangerous user input")) ). The reasons are beyond the scope of this post; just Google “query injection” and pick one of the endless writings on that.