A Leap in Automated Document Classification Technologies Proves to Tame the Document Review Beast – But Here’s What You Need to Know
“Watson,” a computer using machine learning and other techniques, recently made international news when it dominated Jeopardy legends for three consecutive nights, prompting IBM’s general counsel to suggest that machine learning techniques might prove useful in finding information important to litigation.
The application of linguistic analysis technologies and machine learning to improve the way large volumes of information are assessed has been around for more than a decade. For example, probabilistic latent semantic analysis (PLSA), a text-based modeling technique to predict relevance based on a subset of human-assessed documents, has been commercially applied to patent searches, business intelligence tools, online advertising optimization, and even automated essay grading. And predictive coding, another machine learning technique, has been used for distinguishing SPAM from non-SPAM emails, for example. 
Machine Learning in E-Discovery
Such techniques are increasingly being applied in e-discovery, as attorneys’ machine learning approaches help curb the growing cost of document review. Most auto-classification tools take a manual sampling and review process and automate workflow so attorneys can get to important documents faster, with less junk to look at along the way.
For example, automated document classification technologies based on PLSA, combine attorneys’ expertise with computerized review technology and technical expertise to prioritize, or “rank,” documents for review. Attorneys review a sample of the document collection, designating them as relevant or not. Based on results, the technology “learns” how to rank documents based on their likelihood of being relevant. In this iterative process, the software progressively improves accuracy and consistency of its scoring and then ranks the entire collection. Attorneys are presented with groups of prioritized documents, and can quickly access those most likely to be relevant and prioritize (or, in some cases, eliminate) review accordingly.
When machines do the heavy lifting, review is faster and cheaper, and accuracy and consistency enhanced. A recent survey of legal professionals, conducted by Xerox Litigation Services and Acritas Research, found that 72 percent who use automated document classification solutions achieved significant time savings. Another 64 percent said it was more cost-effective than manual review, and over half cited improvements in review consistency, accuracy, and budget planning.
Auto-Classification Technologies Cannot Replace Humans
Given the advantages of machine learning techniques, a whole class of do-it-yourself push button auto-classification software, referred to by some as “predictive coding,” has cropped up over the past few years.
But is technology alone enough to achieve accurate, consistent, and defensible results?
It’s becoming widely understood by attorneys and the courts alike that not only must the proper expertise, audit trail, and measurement be present; the process itself must be repeatable and yield consistent results. In fact, studies conducted by Text REtrieval Conference (TREC) Legal show that document review processes that employ a combination of computer and human input are superior.
Beyond Do-It-Yourself Software: What to Look for in Automated Document Classification Technology
So when evaluating auto-classification technologies, what should legal teams look for? Below are key questions to consider:
- Are the software, process, and workflow transparent? Make sure you are working with a vendor who can explain their process, technology, and expertise they bring to the project - in detail.
- Are the right legal and subject matter experts involved? It is important that senior attorneys or subject matter experts be involved throughout the process to review documents and help “train” the technology.
- Are the appropriate technical experts involved? Relying on software implementation by your IT staff, or a vendor that hosts the software for you, is not sufficient. Technical experts, including statisticians and linguists, can drive the technology and process, and ensure statistically sound sampling, measurement of output, and improved results by developing additional models. A good team will also design and assign attorney workflow.
- How does the technology “learn” and improve? The process must be iterative, and the software must be able to adapt to additional expert review insights to improve output.
- How does the technology identify discrepancies in attorney assessments? Look for a process that identifies ambiguities in how expert reviewers assessed a specific document. Technical experts should be able to route these documents back to the reviewers to confirm or alter judgments on the documents and improve QC.
- Is there a robust audit trail?Experts should provide detailed records of key project parameters and inputs, decisions, and results throughout the process that show consistency when the same inputs and procedures are adopted.
- How are results validated? Statistically valid measurements, such as precision and recall, should be generated at every stage of the process for ongoing QC and final QA.
With rapid advancements in auto-classification technologies in e-discovery, the future of document review is clear. What’s becoming more clear is how and why appropriate human experts are critical in the process to ensure more accurate, consistent, and defensible results.
Kakkonen, T., Myller, N., Sutinen, E., & Timonen, J. (2008). Comparison of Dimension Reduction Methods for Automated Essay Grading. Educational Technology & Society, 11(3), 275–288; www.machinelearning.ru/wiki/images/4/47/LexinVoron08roaieng.pdf; http://portal.acm.org/citation.cfm?id=1963602
www.sap.com/.../2009.../2009_06_Worldtour_CFO11_SAP_fr.pdf; http://cnx.org/content/m11142/latest/ ; http://orcatec.blogspot.com/2011/06/competitors-press-release-about.html
Acritas Research, November 2010.