As criminals increasingly resort to email to send threats, scams, and ransom notes, researchers at the Concordia Institute for Information Systems Engineering (CIISE) have developed a method that can link an author to their own unique “write-print” in every email they send.
Given an email with a questionable source and a set of potential authors, the researchers’ method analyzes frequent patterns in the suspects’ previous writing to determine the email’s author with 80 to 90 per cent confidence. This frequent pattern analysis is distinct from other computerized authorship attribution in that it boasts high accuracy but is still easy to explain to a courtroom. “With our approach, we are not a black box,” said Benjamin Fung, professor and data-mining expert at the CIISE, in an interview with The Daily “We can explain how we derived conclusions.”
The analysis method, a collaborative effort between Fung, Concordia cyber forensics expert Mourad Debbabi, and PhD student Farkhund Iqbal, mines an author’s past emails for bundles of features to determine their personal write-print. The unique features of a particular write-print range from the richness of the author’s vocabulary, to the presence and structure of their salutations. The program can then compare a set of suspects’ write-prints to the email in question and determine who is most likely to have authored it.
Fung freely admitted that there are ways around one’s own write-print, especially because the program can show exactly which stylistic techniques compose a given write-print. However, much like trying to disguise one’s handwriting, it would be difficult to resist falling back into familiar patterns, especially since the program looks for hundreds of bundles of attributes.
Additionally, some attributes are tough to disguise. “For example, vocabulary richness is very difficult to fake,” Fung said. “It’s very difficult for me to suddenly increase my vocabulary.”
Fung and his colleagues developed the method with law enforcement applications in mind. Since most successful authorship-attribution techniques are tailored toward literary plagiarism, the researchers had to develop a method that could glean a write-print from smaller and less formal emails.
Even in combing past emails to determine a write-print, Fung and Debbabi recognize that the method should protect the suspect’s privacy. Much of Fung’s past work focuses on balancing the valuable conclusions that data mining can draw with the individual privacy issues it may raise. “One research direction that we want to work on is how to…perform write-print analysis without compromising privacy.”
With social trends moving toward “micro-communicating” via tweets, comments, and text messages, Fung is not sure how much farther authorship attribution can be ported.
“Whenever we port [the techniques] from one style of writing to another style – like traditional writing to email – we need to add stylometric features to capture the new writing styles,” Fung said. “[With] shorter messages, like SMS or Twitter…it’s definitely a challenge. Maybe there’s no solution.”
The team’s authorship attribution techniques for email, however, proved highly accurate when tested on the Enron database, which contained over 200,000 real emails authored by 158 different employees. Following the program’s success, the CIISE began to field calls from around the world asking for help tracing anonymous emails.
“Some private investigators [call with] real cases, some of the investigators just want to test the cases they have already solved,” Fung said.
Victims of email harassment also contacted Fung for help. “I received many emails from many different people from different countries, saying, ‘Somebody is sending me a threatening email, would you please help me to solve this?’” Fung said. “They just read the newspaper title and send me an email.”
So far, there has been no concrete collaboration between the team and law enforcement or private individuals. The team is currently focusing on applying their methods in a second scenario: using an email to infer certain characteristics about the author – such as nationality or education – even when there aren’t any suspects.
Fung stressed that this procedure would be significantly less accurate, though “it would be useful for the early stages of investigation,” he said, “where the investigator has very little clues.”