Parallel Web Pages in English and Regional Languages

1236 Words Feb 24th, 2018 5 Pages
However, new challenges arise when it is applied to non-Latin-script based languages, particularly for Asian languages' web pages. Web page classification creates new research challenges because of the noisy nature of the pages. It’s no doubt that English has been the predominant language for the World Wide Web since its inception and so it’s usage is confined to a specific community of people have a good grasp of the English language. The serviceability factors of the Internet have proven to be beneficial to a highly educated society, because of the linguistic barrier. The solution to this problem is to provide web pages in regional languages. Our aim is to provide web pages in pairs, of Devanagari and English web pages if it exists. In order to provide parallel Web Pages in native language Hindi or Marathi on the fly we require classification of web pages in Devanagari and English. We had experiment on 500 web pages in English and Devanagari web pages and could label it correctly.

Keywords: Classification of Devanagari Web pages, UTF-8 Encoding .
1.Introduction

With the explosion of multi-lingual data on the Internet, the need and demand for an effective automated language identifier for web pages is further increased. Web search in Indian languages is constantly gaining importance. With the fast growth of Indian language content on the web, many…
Open Document