preview

Parallel Web Pages in English and Regional Languages

Better Essays

Language identification of written text in the domain of Latin script based languages is a well-studied research area. However, new challenges arise when it is applied to non-Latin-script based languages, particularly for Asian languages' web pages. Web page classification creates new research challenges because of the noisy nature of the pages. It’s no doubt that English has been the predominant language for the World Wide Web since its inception and so it’s usage is confined to a specific community of people have a good grasp of the English language. The serviceability factors of the Internet have proven to be beneficial to a highly educated society, because of the linguistic barrier. The solution to this problem is to provide web pages in regional languages. Our aim is to provide web pages in pairs, of Devanagari and English web pages if it exists. In order to provide parallel Web Pages in native language Hindi or Marathi on the fly we require classification of web pages in Devanagari and English. We had experiment on 500 web pages in English and Devanagari web pages and could label it correctly.

Keywords: Classification of Devanagari Web pages, UTF-8 Encoding .
1.Introduction

With the explosion of multi-lingual data on the Internet, the need and demand for an effective automated language identifier for web pages is further increased. Web search in Indian languages is constantly gaining importance. With the fast growth of Indian language content on the web, many

Get Access