Crowdsourcing the Development of Underserved Language Resources

anas's picture
November 17, 2011 - 15:26 -- anas
Revision #8ForkRecommend a Solution

 

The provision of affordable, accessible and sustainable data, tools and technologies in local languages is necessary for developing world populations across the globe to allow them access to the knowledge society and economy, to both consume and to generate relevant content. This includes access to appropriate networks and Information and Communication Technologies (ICTs) supported by adequate Human Language Technologies (HLT). There is an urgent need to realize the fundamental rights of the citizens of the world to have access to information in their language, information that will allow them to improve their economic situation, their education, their legal rights, and their health. A major challenge that still faces the development of a truly inclusive and diverse global information society is the extreme scarcity of language resources that can be utilized by researchers and practitioners to build human language technologies (HLT) for countries in the developing world. Unless resolved, this issue will prevent the vast majority of the next billions of the world's citizens, who rely exclusively on their native languages to consume and produce information, from participating in the global information society.

 

This project aims at tackling this challenge by leveraging open content, mobile technologies and crowd-sourcing to create language resources for the underserved world languages and make them available under open licenses to stimulate research and development in the area of Human Language Technologies (HLT). The project will use existing open text repositories (such as Wikipedia) in language such as Swahili, Arabic and Urdu, and will create a crowd-sourcing mechanism for developing these text repositories into language corpora. This could include, for example, tagging the words in the corpus based on part of speech (a process known as Part of Speech Tagging). For this purpose, a platform can be built to extract sentences from the corpus and send it to a group of contributors through text messages. Each contributor can examine the sentence and determine the tag for each word in the sentence (verb, noun, adjective, etc.) and send it back to the platform. Redundant responses from several contributors will be used to ensure the correctness of the answers and to flag any potential errors. Participation in the platform can be encouraged through several means. For example, contributors may be rewarded for their participation with mobile credit they can use on their phones, or a badge system could be applied to acknowledge active contributors. The participation process can also be possibly structured around a game-like style.

 

 

User Stories: 

 


  • A research team uses the platform to upload a corpus of text in their local language that needs to be tagged by the mobile crowd
  • A mobile user receives a text message that contains a sentence that needs tagging. The user adds the appropriate abbreviations to each word to identify its part of speech and sends the message back to the platform.
  • The platform collects and validates all contributors, constructs the tagged corpus and sends it back to the researchers.

 

Constraints: 

The platform is cloud-based, and hence is easy to implement. It also leverages ubiquitous mobile phones and breaks down the task to small manageable and easy to do chunks which encourages participation.

Qualitative Impact: 
The potential impact of this platform is tremendous given the pressing need for quality language resources to empower research and development of language technologies, particularly for underserved languages. The platform will also be generic enough to be easily customizable and usable for any language, multiplying its actual impact.
Problem Definition Category: 

Comments

Hey Anas,

 

Cool project!

 

How about doing segmentation rather than tagging?

 

Tagging: 

 

 

Example:

Input: J'aimerais bien ca moi, du gateau. 

Output: n v adv dem pro p n

 

* borring

* no one really knows what a noun and verb are

* only English needs to be tagged for noun and verb (minority languages have suffixes which disambugate for us, including Swahili, Arabic and Urdu)

* tags are fundamentally not a grain of analysis, for neither humans nor machines

* tagging is sooo 1980's

 

Segmentation:

 

Example:

Input: J'aimerais bien ca moi, de la segmentation. 

 

Output: j aim er ais bien ca moi de la segment ation

 

* less borring

* almost all native speakers have an intution about where suffixes start and end

* a fundamental grain of analysis for both humans and machines

* interesting where humans draw the line on segmentation, do they go deep and segment everything, or do they go shallow and segment most or just some

* need segmentation to bootstrap any unsupervised machine learning algorithm for parsing or extracting meaning

* segmentation is THE fundamental problem behind Searching for Information, hence a major roadblock in these minority languages and getting information to them...

* segmentation is the 2010's

 

 

Someone was mentioning collecting data via Rock Band, would totally be less boring than via SMS... so you're definately right about the game idea

 

 

Anyway, rock on!

cesine Dec 04, 2011

UI/UX design support provided by Azavea