Abstract—The Croatian language is a pitch-accent language, in which the tone contour realized in the stressed syllable carries the lexical information. Therefore, in some cases, a different lexical accent gives the word a different meaning. In such cases, the ambiguity of the word in written texts, where accents are not usually marked, can be solved by determining the appropriate accent. There are also cases when various basic and derived forms of words have different meanings, different morphosyntactic descriptions (MSDs), and possibly different accents. When words have the same written forms but different meanings, they are called homograms. In order to resolve the ambiguity of homograms, we created a lexicon of homograms that is comprised of all Croatian nouns of different gender, which have the same written forms (if accents are not marked) but different meanings, MSDs, and possibly different accents. This lexicon consists of 19,366 entries and 3,460 unique homograms. Each entry in the lexicon comprises the homogram (unaccented word), the accented word, the corresponding MSD, and the accented lemma. The obtained lexicon enables us to identify and disambiguate homograms within the corpus efficiently and accurately. We also evaluated and analyzed the performance of machine translation (MT) systems for the Croatian–English language pair with a special emphasis on homogram translation. We confirmed that the disambiguation of homograms can improve the performance of MT systems in avoiding major translation mistakes related to assigning the wrong meaning to homograms.
Index Terms—Disambiguation of homograms, lexicon of homograms, pitch accent language, word sense disambiguation.
The authors are with the Department of Informatics, University of Rijeka, Croatia (e-mail: firstname.lastname@example.org, email@example.com).
Cite: The Role of Homograms in Machine Translation, "Lucia Nacinovic Prskalo and Marija Brkic Bakaric," International Journal of Machine Learning and Computing vol. 8, no. 2, pp. 90-97, 2018.