Title Mining a parallel corpus using in multilingual machine translation


The word ‘translation’ refers to transformation of one language into other.  Machine translation means automatic translation of text by computer from one natural language into another natural language. One of new approaches in machine translation is the statistical approach. Statistical machine translation tries to generate translations using statistical methods based on large parallel bilingual corpora for source and target languages. The translation model is built based on the alignment of words or sequence of words between two languages. The quality of the training process depends on the quality and quantity of the parallel training corpus, especially for under-resourced languages.

The extraction methods for mining parallel data from text have been researched in recent years. The topic requires student study the existed mining methods, and propose mining method which can be applied for multi languages.

Work:
-   Theory about mining methods, to extract parallel corpus of text
-   Propose mining method which can be applied for multi languages.
-   Research and develop the proposed method to build a parallel corpus of 3 or 4 languages (priority for Vietnamese, Khmer, Thais, ...)
-   Propose possible resources to collect a big corpus
-   Deploy the mining method on the proposed resources to build a big parallel corpus of 3-4 languages.
-   Verify the efficiency of the proposed method

Student prerequisites This subject is dedicated to Vietnamese students as well as foreigner. The students who have a fairly good knowledge about text processing and programming are privileged.
   
Supervisors/contacts DO Thi Ngoc Diep, Researcher/Ph.D, email: ngoc-diep.do at mica.edu.vn