Dakshina Dataset
A collection of text in both Latin and native scripts for 12 South Asian languages.
dataset natural-language-processing languages dakshina paper research code

The Dakshina dataset is a collection of text in both Latin and native scripts for 12 South Asian languages. For each language, the dataset includes a large collection of native script Wikipedia text, a romanization lexicon which consists of words in the native script with attested romanizations, and some full sentence parallel data in both a native script of the language and the basic Latin alphabet.

Don't forget to tag @google-research-datasets , @roark-google , @mwurts4google in your comment, otherwise they may not be notified.

Authors community post
Datasets released by Google Research
Share this project
Similar projects
Quora Question Pair Similarity
Identify which questions asked on Quora are duplicates of questions that have already been asked. Using Text features, classifying them as duplicates or ...
Lazynlp
Library to scrape and clean web pages to create massive datasets.
The Abstraction and Reasoning Corpus (ARC)
Can a computer learn complex, abstract tasks from just a few examples? ARC can be used to measure a human-like form of general fluid intelligence.
COVID-Q: A Dataset of 1,690 Questions about COVID-19
This dataset consists of COVID-19 questions which have been annotated into a broad category (e.g. Transmission, Prevention) and a more specific class such ...