Skip to main content
notice

Master Thesis Defense: Dharani Kumar Palani

July 30, 2018
|


Speaker: Dharani Kumar Palani

Supervisor: Dr. P. Rigby

Examining Committee: Drs. J. Rilling, R. Witte, T.-H. Chen (Chair)

Title: English to Software Code Statistical Machine Translation

Date: Monday, July 30, 2018

Time: 11:00am

Place: EV 1.162

ABSTRACT

Statistical Machine Translation (SMT) has gained enormous popularity in recent years as natural language translations have become increasingly accurate. In this thesis we apply SMT techniques in the context of translating English descriptions of programming tasks to source code. We evaluate four existing approaches: maximum likelihood word maps, Contextual Expansion, phrase-based, and neural network translation. As a training and test (i.e. reference translation) data set we clean and align the popular developer discussion forum StackOverflow.

Our baseline approach, WordMapK, uses a simple maximum likelihood word map model which is then ordered using existing code usage graphs. The approach is quite effective, with a precision and recall of 20 and 50, respectively. Adding context to the word map model, Contextual Expansion, is able to increase the precision to 25 with a recall of 40. The traditional phrase-based translation model, Moses, achieves a similar precision and recall also incorporating the context of the input text by mapping English sequences to code sequences. The final approach is neural network translation, OpenNMT. While the median precision is 100 the recall is only 20. When manually examining the output of the neural translation, the code usages are very small and obvious. Our results represent an application of existing natural language strategies in the context of software engineering. We make our scripts, corpus, and reference translations in the hope that future work will adapt these techniques to further increase the quality of English to code statistical machine translation.




Back to top

© Concordia University