Computers can complete complex mathematical computations in less than a second, but the subtlties of languages have long escaped their mechanical minds.
Tavish Pegram ‘13 has done his little part in fixing this problem.
For his computer science research and thesis project, Pegram created a program that could scan and identify the names of any person in Middle English texts.
He found existing programs, called taggers, that recognize certain types of words in Modern English and modified them specifically for Middle English.
A tricky task, according to Pegram, considering that the major difference between the two is that Middle English, spoken between late 12th and late 15th century, before England’s introduction of the printing press, was not standardized.
“[Middle English] doesn’t really care about spelling, word order, punctuation is inconsistent,” said Pegram. “It is a neat poetic feature, but basically makes any kind of computer based analysis extremely difficult.”
Without a consistent grammar, Pegram , who began work last September under the stewardship of his adviser Professor of Computer Science Nancy Ide, struggled to design general rules for computers to read Middle English and pick out people’s names.
Pegram’s thesis combines two longheld academic passions.
He is a Computer Science and English double major with a correlate in Mathematics. He had known he wanted to be a double major before coming to college and the decision, he hopes , will give him an edge after college too.
“I think it is helpful for at least being employable, because Computer Science majors are notoriously bad at communicating, and [an] English [B.A.], even if it isn’t true at least implies a certain ability,” said Pegram.
After graduation, Pegram will be working for the medical software company Epic in Madison, Wis. Eventually, Pegram said, he aspires to work with computational linguistics.
Computational Linguistics is the field of study that studies language through computers. The iPhone’s Siri, which can interpret certain demands and reply with an index of responses, and Google Translate, which can translate snippets of text from one language into another, are examples of computational linguistics used in everyday life.
However, these achievements are still nowhere close to matching a human’s capacity for reading, speaking and understanding language. The average student might encounter these difficulties when asking their phone for directions or seeking help from Google translate for a spanish paper. Pegram’s taggers, however, kept tripping over the messiness of Middle English where spelling and punctuation of a name can vary even within a single text. The two works Pegram tested his program on were Havelok the Dane and King Horn, two chivalric romances from the 13th century.
There are more spelling divergences. The name “Havelok” can also be spelled Haveloc,” “Wiliam” is also known as “Willam,” and there are no less than five different ways to spell “Fikenild,” the name of the treacherous villain in King Horn.
Pegram wrote a list, or Gazeteer, of nearly 100 names and their alternative spellings for the program to memorize as persons’ names in the two texts he is working with.
Middle English cannot even agree on the spelling of one famous name. References to God are common in the texts Pegram was using, and so he had to be certain that his taggers could recognize God as a name.
Pegram had decided at the start of his thesis, to make his work easier, to train his taggers only on the names of people and ignore other proper nouns, but then was God closer to people or closer to something like a country or an organization? What was God?
On this unexpected question raised in a computer science thesis, Pegram said, “It is ambiguous…I’m including God as a named identity, as a person.”
With that mystery, and potential moral dilemma, solved, the next step was finding a set of rules that marked God out as a name. However, this was Middle English and neither spelling nor capitalization could be relied on.
The word for “good” was once spelled “god,” so sometimes Pegram’s tagger would erroneously conflate the word for agreeable or pleasant with the name of the supreme being of the universe. Also the adjective “god” was sometimes capitalized “God.” and sometimes the name “God” could be spelled G-O-D-E. “Gode.”
Working through challenges with the grammar and the software took trial and error and required both patience and diligence. Sometimes, he said, it required him to take a step back from his work for a while.
Said Pegram, “One thing about computer science is that nothing ever works the first time or the second time or usually the third or fourth times and it usually doesn’t work until you get really frustrated and leave and then come back a week later and it works by itself.”
Pegram believed his thesis could be useful for students like himself that are tracing the evolution of the English language.
Pegram said, “It’s helpful for teaching and linguistic research. If you have the tags in place it makes it easier to interact with the text make it a more helpful teaching tool.”
When asked if his taggers could work on the Geoffrey Chaucer’s Canterbury Tales, the most famous work of literature in Middle Enlgish, Pegram shrugged, “It should be able to.”