How you can benefit from Open Source Natural Language Processing
Authored by: Phil Rhodes, Senior Consultant at Open Software Integrators
Natural Language Processing (NLP) has a long history, dating back to the earliest days of computer science. Computer scientists have been working on means to allow machines to “understand” written language for decades now. And while the field suffered some setbacks in the 1960's and again during the “AI Winters” of the 1970’s and late 1980s, significant progress has been made, and freely available, open-source toolkits make it possible for almost any organization to begin taking advantage of this powerful technology.
“What”, you may be asking, “can I use NLP for”? It turns out that NLP has a tremendous number of uses in the modern enterprise, which remains heavily text driven. Consider the sheer volume of text which flows through your organization on a daily basis, in the form of email, Word documents, Powerpoint slide decks, and instant messaging. If we can imbue computers with the ability to “understand” this text then we can automate workflows, automatically route documents to the users who need them, quickly classify documents for more efficient retrieval at a later date, quickly extract summaries from long documents, and allow computers to answer questions posed in their natural form (eg, “What is the Capital of Canada”?).
Imagine, for example, being able to seamlessly integrate email interactions into your daily workflows, without requiring a person on each end to stop, read and parse the email, and then possibly take some action. With NLP you can have the computer “read” the email, extract relevant instructions and then carry out the other tasks in the workflow, perhaps approving or denying a vacation request or routing a document to a particular individual in your organization.
Using automatic summarization and document classification techniques, you can have the computer pre-read large numbers of documents for you, and isolate the ones that are “about” topics you are interested in depending on your position. Whether it's “bottom line revenue” or “The Frozgobbit Project”, you can avoid tedious scanning through large groups of documents, looking for a handful that contain relevant information.
Sentiment Analysis allows you to determine the “tone” or mood of a body of text, so you can categorize a customer email as “angry” or “flattering” and make routing decisions or priority assignments based on that classification; you can also use SA to read blogs and determine if posts in the blogosphere are promoting your organization or deriding it. Another use would be to categorize support ticket entries, so that frustrated / angry customers are not overlooked and left to stew on their frustrations, and perhaps decide to take that “quick peek” at your competitor's website.
Question and Answer systems also become reality using NLP tools, as you may have gathered from watching the performance of IBM's Watson computer on the television program Jeopardy. Of course, Watson represents the cutting edge of NLP application, both in terms of hardware and software, and is probably financially out of reach to many organizations. Luckily, there is a large body of freely available, Open Source Software for performing NLP. And while your home built system using a popular OSS NLP toolkit like OpenNLP or NLPTK might not win Jeopardy soon, it can certainly speed up processing, eliminate waste and reduce inefficiencies within your enterprise.
To begin utilizing Natural Language Processing, you need to understand a few basic operations which are core to NLP. In most NLP applications, the first step is to perform Sentence Detection, and reduce the body of text being analyzed, into its constituent sentences. This turns out to be a surprisingly challenging problem, as it’s not as simple as looking for a punctuation mark, such as a period, and assuming it ends a sentence. Punctuation marks are used in abbreviations, for example, or are included in nested sentence fragments included in quotes.
Once discrete sentences have been identified, the text of each sentence is tokenized, or separated into a stream of words, punctuation marks, numbers, etc. Once a sentence has been tokenized, it is necessary to extract “named entities” (people, locations, organizations, etc.) and identify the “part of speech” of the words in the sentence. These operations are called Named Entity Recognition (NER) and Part of Speech Tagging (POS Tagging) respectively. POS Tagging identifies a word as a noun, verb, adverb or other part of speech, and can vary dramatically based on context. For example, the word “board” might be a noun representing a rectangular slab of wood, or a verb as in “board the train”. The same word can even mean different things in the same sentence, as in “Board up that window with that board over there”.
Another major operation to perform as part of NLP is “chunking” in which a sentence is reduced to groups of syntactically related words. For example, in the sentence “The quick brown fox jumps over the lazy dog” you would need to identify “the quick brown fox”, “jumps over” and “the lazy dog” as related sections.
Modern NLP uses statistical and machine learning algorithms to perform these operations, and is based on a training model which is generated from training data fed to the system. An approach called Maximum Entropy is one of the most commonly encountered approaches in NLP today. Other Machine Learning algorithms may also be used.
Fortunately, you do not have to be a statistics guru or machine learning expert to use NLP. A number of pre-built toolkits exist, which include the fundamental algorithms and support needed for sentence detection, tokenization, NER, POS Tagging and other NLP operations. With these toolkits, you simply include a library in your application, and use the toolkit API to perform the low level NLP operations. Once you have text which is tagged, chunked and has named entities extracted, you apply your own custom application logic, without needing to worry about the “plumbing”. Even more fortunately, the Free / Open Source Software (F/OSS) world is a hotbed of NLP activity and a many institutions conducting NLP research release their code under a F/OSS license. As a result, there are several high quality NLP libraries, freely available to your organization.
Of the widely known F/OSS NLP toolkits, a few stand out as particularly useful to organizations beginning to use language processing. Apache OpenNLP and Mallet in the Java world, FreeLing and Ellogon for C/C++, and Natural Language Toolkit (NLTK) for Python, are all high quality, well documented, well supported toolkits in this space.
OpenNLP is a Java based NLP toolkit, licensed under the Apache Software License v2 and is currently being developed as a project within the Apache Software Foundation. OpenNLP supports most of the commonly used NLP operations, including tokenization, sentence segmentation, part of speech tagging and named entity extraction. OpenNLP includes machine learning algorithms including maximum entropy and perceptrons, to enable the building of systems which learn over time. The project is well documented, with a healthy community around it, and is an excellent starting point for NLP work in Java centric organizations.
Mallet is a Java based toolkit for NLP, and other text processing tasks. Mallet supports basic NLP operations including tokenization, POS tagging and named entity extraction, but also includes more advanced algorithms for document classification, sequence tagging and topic modeling. Mallet is developed under the auspices of the University of Massachusetts at Amherst and is released under the Common Public License.
FreeLing is a C++ based NLP package. FreeLing is developed at the Universitat Politècnica de Catalunya · BarcelonaTech in Catalonia, Spain, and is released under the GNU General Public License (GPL). FreeLing has full support for the major NLP operations, including text tokenization, sentence boundary detection, POS tagging, named entity extraction, parsing and coreference resolution. The package can be used as a library and embedded into your own programs, or used as a standalone program to analyze text files using a command line interface.
Ellogon is a C based language engineering toolkit, released under the GNU Lesser General Public License (LGPL). Ellogon goes beyond basic NLP and is intended as a research tool for computational linguistics as well as a tool which is suitable for end users. The package is described as “a multi-lingual, cross-platform, general-purpose language engineering environment, developed in order to aid both researchers who are doing research in computational linguistics, as well as companies who produce and deliver language engineering systems”.
NTLK is a Python based NLP toolkit, which enables users to build programs to work with human language text. It provides support for standard NLP operations including classification, tokenization, stemming, tagging, parsing and semantic reasoning. NLTK also supports interfaces to over 50 corpora and lexical resources, including WordNet. NLTK is released under the Apache Software License v2 and has a vibrant community around it.
An excellent way to start exploring the world of NLP is by working with the Enron Corpus - a large body of real-world emails which became public domain as a result of being subpoena’d during the Enron trail. Text mining and NLP researchers have found this to be an excellent corpus to work with, as the data is - like most real world data - a bit messy. It reflects the way people communicate in real life, as opposed to being a purely academic tool. The Enron Email Dataset website has the actual data available for download, as well as pointers to articles, research papers, and other researchers working with this corpus.
Download any one of these popular toolkits, grab the Enron corpus and begin exploring. Once you have a feel for the power of NLP, you can move on to building applications around your own data. Once you do so, you will find that NLP applications are a valuable way to leverage technology to assist the members of your organization. You can automate previously error-prone manual steps in critical workflows, classify documentation for faster retrieval and more efficient routing, and quickly mine large bodies of text for essential insights.