Natural Language Processing, Part Three: At Northwestern

Posted April 26, 2019 by Philip Burns

This is the third article of four about Natural Language Processing. The first article introduced basic concepts and fundamental methods. The second article discussed advanced methods in NLP.

In Northwestern’s research computing divisions we have developed and adapted a large amount of NLP software over the past 20 years, primarily to support literary research and learning analytics. Other departments at Northwestern have done the same. Our peer institutions have also developed excellent NLP software with which we are familiar.

Recently there has been an explosion in the availability of commercial NLP software, much of which builds upon work originally produced at universities. “Cloud” services for NLP are available from IBM, Amazon, Microsoft, and Google, among others.

But work on projects incorporating Natural Language Projects at Northwestern goes back at least to the mid 1990s. The Chicago Homer dates from this time period.

From the web site: “The Chicago Homer is a multilingual database that uses the search and display capabilities of electronic texts to make the distinctive features of Early Greek epic accessible to readers with and without Greek. Except for fragments, it contains all the texts of these poems in the original Greek. In addition, the Chicago Homer includes English and German translations, in particular Lattimore’s Iliad, James Huddleston’s Odyssey, Daryl Hine’s translations of Hesiod and the Homeric Hymns, and the German translations of the Iliad and Odyssey by Johan Heinrich Voss. Through the associated web site Eumaios users of the Chicago Homer can also from each line of the poem access pertinent Iliad Scholia and papyrus readings.”

The Editors of the Chicago Homer were Ahuvia Kahane, formerly of Northwestern’s Department of Classics, and Martin Mueller, Departments of English and Classics, Northwestern University. Technical editors were Bill Parod and Craig Berry.

Several projects in which I have worked with Martin Mueller have concentrated on applying NLP methods to literary texts, especially Early Modern English texts (~1475 to 1700). This corpus of texts, digitized in a reasonably consistent format, with light NLP processing added, amounts to a first stage “Book of English.” The availability of the enhanced texts facilitates the work of literary scholars and students. Projects in this series have included VOSPOS, WordHoard, Monk, MorphAdorner, and the currently ongoing EarlyPrint. Except for VOSPOS, the projects received generous financial assistance in the form of grants from the Mellon Foundation. VOSPOS was funded by grants from Proquest and the CIC Universities.

VOSPOS sought ways to automate the modernization and standardization of spellings in older English texts printed before spelling had become regularized. Martin Mueller was the editor, and Jeff Cousens and Philip R. Burns did the programming, with the assistance of dedicated student assistants. All of the methods from VOSPOS were eventually incorporated into MorphAdorner.

WordHoard is an application for the close reading and scholarly analysis of deeply tagged texts. Martin Mueller was the faculty sponsor for WordHoard. Developers included Bill Parod, Jeff Cousens, Philip R. Burns, John Norstad, and Craig A. Berry. WordHoard contains the entire canon of Early Greek epic in the original and in translation, as well as all of Chaucer, Shakespeare, and Spenser, along with 321 Early Modern English plays by other authors.

MorphAdorner performs a variety of linguistic adornment tasks on texts. It can perform basic NLP tasks as well as many of the advanced tasks described in Part Two of this series. MorphAdorner was written by Philip R. Burns to support the WordHoard, Monk, and EarlyPrint projects.

Monk (“Metadata Offer New Knowledge”) sought to combine the work we did in WordHoard with machine-learning based methodology from the Nora project at the University of Illinois. The principals were Martin Mueller and John Unsworth, and a large cast of worker bees including myself. This multi-university project also included the development of methods for regularizing texts encoded in TEI (Text Encoding Initiative) format from the University of Nebraska, as well as a variety of new analytics displays from the University of Maryland and several Canadian universities. Monk can be considered the ancestor of many later projects including the Hathi Trust and EarlyPrint.

EarlyPrint aims at creating a deduplicated digital library of most English books published before 1700. From the web site: “This library will be freely accessible to the public, and each item in it should be a complex digital surrogate including:

  1. A transcription that strikes a balance between being faithful to the printed source while being easy to read and use on the laptops and other mobile devices that are the ‘tables of memory’ on which 21st-century scholars do much of their reading and writing.
  2. Good quality page images that provide the witnesses to check the transcriptions and offer modern readers a sense of the materiality of the text in its original embodiment.
  3. Bibliographical, structural, and linguistic metadata that can be used separately or in conjunction to explore particular texts and support forms of “distant” or “scalable” reading across the entire corpus or parts of it.

A fourth and critical feature of this library is a framework that supports collaborative curation and allows users to offer corrections of the most common forms of textual corruption in the transcriptions. A working version of such a framework is implemented on this site, but it will need improvements over time.” At the time of writing of this page, EarlyPrint contains about 30,000 of the texts from the TCP EEBO and Evans collections.

The editors of EarlyPrint include Martin Mueller, Craig A. Berry, Philip R. Burns, Elisabeth Chaghafi, Joe Loewenstein, and Kate Needham.

The Pulter Project is a collaboration between Wendy Wall of Northwestern’s English Department and Weinberg College’s Media and Design Studio. Hester Pulter’s book of poetry sat ignored for over 300 years. In 2014 British scholar Alice Eardley published a printed copy of the book, which rapidly went out of print. Luckily Wendy Wall along with Leah Knight, associate professor of English at Canada’s Brock University, worked with Josh Honn of Northwestern’s Library and the Media and Design Studio (notably Matt Taylor, Sergei Kalugin, and Chad Davis) to produce a scholarly online version of Pulter’s work.

Postdoctoral Fellow Martin Gerlach in Chemical and Biological Engineering, along with his collaborators Tiago Peixoto (University of Bath) and Eduardo Latmann (University of Sydney), applied methods from community detection in networks to improve topic modeling. Their approach automatically determines the number of topics and hierarchically clusters both the words and the documents.

Several Learning Analytics Tools developed by members of the Teaching and Learning Technologies group at Northwestern incorporate NLP methods. The Discussion Analytics Tool written by Jacob Collins analyzes discussions in Northwestern’s Canvas Learning Management System. NLP tools extract keywords and named entities such as organizations, place names, and persons from forum posts. The tool can also compute the reading level for each post expressed as a grade level. The overall sentiment of each post can be measured as positive, negative, or neutral. The tool outputs this information in a single downloadable comma separated file (CSV) which can be opened in other tools such as Microsoft Excel for further analysis and visualization.

The Yellowdig Viz tool written by Patricia Goldweic visualizes class interactions in Yellowdig discussion boards as a network graph. Named entities and keywords extracted from the interactions can be “pinned” and attention focused on them, and also filtered by time.

The Canvas Support Chatbot, also written by Patricia Goldweic, uses IBM Watson NLP technology. It assists in answering frequently asked questions about Northwestern’s Canvas learning management system. While the chatbot specializes in helping with enrollment related questions, it can also help with other Northwestern specific Canvas issues.

University-based research and development of NLP tools should not be thought important only to scholars. NLP capabilities are increasingly important in the classroom as the learning analytics tools mentioned above demonstrate. NLP tools are also increasingly used in Northwestern’s administrative and accounting offices.

As an example, Rich Gordon of Northwestern’s Journalism Department and Larry Birnbaum of the Department of Computer Science have taught courses pairing students in Journalism with students in Computer Science. The students work in groups to discover novel ways to marry computer science technology with traditional journalism approaches to data gathering, analysis, and reporting. Additional support is provided by Joe Germuska of Northwestern’s Knight Lab, which specializes in developing methods for news. Unsurprisingly many of the student projects involve using NLP methods. What was once mainly the provenance of computer science researchers is rapidly becoming the daily bread-and-butter of working journalists.

In years to come we can expect NLP to be integrated at all levels of the university curriculum. As Martin Mueller has suggested, a lot of marketable skills can be acquired from “reading” using eyes and NLP tools together.