Extracting Parallel Phrases from English-Punjabi Corpora
This study presents a novel approach to extract parallel data from a comparable English-Punjabi corpus, addressing the scarcity of parallel corpora for this language pair. Unlike previous research, this approach focuses on creating high-precision parallel data using minimal resources. The data is sourced from diverse domains, including Wikipedia articles, TDIL's noisy parallel sentences, and Gyan Nidhi reports. The methodology consists of three phases: extracting and aligning documents, translating Punjabi texts into English using OpenNMT-py, and calculating content similarity through three measures-Euclidean Distance, Cosine, and Jaccard. These algorithms are run individually, and then their results are integrated to improve accuracy. By combining the scores of all three measures, the system achieves a precision of 93% and an accuracy of 86%. This integrated approach significantly enhances parallel data extraction for English-Punjabi corpora and holds potential for improving Statistical Machine Translation (SMT) models.
1146518944
Extracting Parallel Phrases from English-Punjabi Corpora
This study presents a novel approach to extract parallel data from a comparable English-Punjabi corpus, addressing the scarcity of parallel corpora for this language pair. Unlike previous research, this approach focuses on creating high-precision parallel data using minimal resources. The data is sourced from diverse domains, including Wikipedia articles, TDIL's noisy parallel sentences, and Gyan Nidhi reports. The methodology consists of three phases: extracting and aligning documents, translating Punjabi texts into English using OpenNMT-py, and calculating content similarity through three measures-Euclidean Distance, Cosine, and Jaccard. These algorithms are run individually, and then their results are integrated to improve accuracy. By combining the scores of all three measures, the system achieves a precision of 93% and an accuracy of 86%. This integrated approach significantly enhances parallel data extraction for English-Punjabi corpora and holds potential for improving Statistical Machine Translation (SMT) models.
85.0
In Stock
5
1

Extracting Parallel Phrases from English-Punjabi Corpora
204
Extracting Parallel Phrases from English-Punjabi Corpora
204Paperback
$85.00
85.0
In Stock
Product Details
ISBN-13: | 9786208225414 |
---|---|
Publisher: | LAP Lambert Academic Publishing |
Publication date: | 10/25/2024 |
Pages: | 204 |
Product dimensions: | 6.00(w) x 9.00(h) x 0.47(d) |
From the B&N Reads Blog