It’s not often a third year undergrad student has a research paper accepted at a major conference. When it happened for Leah Jo earlier this year, she and her co-authors were quite elated.
The UBC computer science undergraduate was in third year and taking a few linguistics electives. One of her linguistics instructors, Dr. Jungyeul Park, invited Leah to participate in a Natural Language Processing (NLP) research project that was already underway, and she jumped at the chance.
The result was this paper: Yet Another Format of Universal Dependencies for Korean. Chen, Y.*, Jo, E. L*., Yao, Y.*, Lim, K., Silfverberg, M., Tyers, F. M., & Park, J. (2022). *Equally contributed authors.
The paper was presented at the 29th International Conference on Computational Linguistics (COLING) in October.
Dr. Park had been working on a morpheme-based scheme for dependency parsing with various languages beyond English, which included Korean. The research team was already half way through their research when they decided Leah Jo could lend some extremely valuable help.
For example, the word dog is a morpheme. The –s at the end of dogs is also a morpheme.
Because Leah is fluent in Korean and is studying computer science in addition to linguistics, Dr. Park believed she could bring a great amount of collective insight into the research. And indeed, she did.
“My role was helping with error analysis on the research results. I ran programs to test for errors and uncover the cause,” Leah said. She explained that the team was then able to develop a new scheme which produced much better results that outperformed the existing scheme.
Helping voice agents sound natural, in any language
One real-world application of this type of research is voice agents. When training voice agents (like Siri or Alexa) in other languages beyond English, it is a more complicated and nuanced process. That’s because some languages have rich morphology, or indirect parsed dependencies. Korean has an extremely rich morphology.
Ensuring there is a close representation when training machine agents to get these nuances correct, depends highly on the type of work explored in this UBC paper.
The result from the paper is the proposal of a new annotation scheme for Universal Dependencies for Korean (Universal Dependencies is an international cooperative project to create parsed text corpus that annotates sentence structures of the world's languages). The morpheme-based format they developed follows the linguistic properties of Korean and improves parsing performance.
An international collaboration
“It was such an interesting experience to be a part of the research,” Leah said. “It was an international effort, so I was working with people from all over the world on different time zones. I am very happy and excited that the paper was accepted.”
She also expressed what it meant to contribute in her mother tongue. “The fact that I could contribute to the research using my knowledge directly related to my heritage was very meaningful.”
Dr. Park added, “Leah Jo’s analytic mind in computational thinking plays one of the most important roles in the paper, where she contributed the error analysis in qualitative and quantitative ways. Working with Leah was an absolute pleasure for our entire team.”
The intersection of computer science, linguistics and Korean is constantly improving, thanks to people like Leah and her co-authors.