Humongous Corpus




OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

OSCAR is currently shuffled at line level and no metadata is provided. Thus it is mainly intended to be used in the training of unsupervised language models for NLP.

Data is distributed by language in both original and deduplicated form. There are currently 166 different languages available. If you use OSCAR please consider giving us some feedback using the contact form down below. Also consider citing our paper.

If you want to contribute to OSCAR, for example by tokenizing one of the corpora for a particular language, or by helping us translate our webpage, please open a pull request here.

The corpus was put together by Pedro J. Ortiz, Benoît Sagot, and Laurent Romary.

New: If you need the unshuffled version of OSCAR, please contact us using the contact form down below. Please include your name, affiliation, contact details, which languages do you need and a brief description of how you intend to use OSCAR.

Even though OSCAR is not Postcardware, we do appreciate when our users send us a postcard. If you want to send us one, you can find the address in the contact section down below.


Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures

We propose a new pipeline to filter, clean and classify Common Crawl by language, we publish the final corpus under the name OSCAR.

Corpus Download

All the data is distributed by language, both the original and the deduplicated versions of the data are available. To download a file just click the desired link on the table below. We recommend the use of pigz to decompress the bigger files in OSCAR.

All sizes are for the uncompressed files.

New: If you need the unshuffled version of OSCAR, please contact us using the contact form down below. Please include your name, affiliation, contact details, which languages do you need and a brief description of how you intend to use OSCAR.

Even though OSCAR is not Postcardware, we do appreciate when our users send us a postcard. If you want to send us one, you can find the address in the contact section down below.

Language Words original Size original File original Words deduplicated Size deduplicated File deduplicated
Afrikaans 43,482,801 241M af.txt.gz 29,533,437 163M af_dedup.txt.gz
Albanian 374,196,110 2.3G sq.txt.gz 186,856,699 1.2G sq_dedup.txt.gz
Amharic 28,301,601 360M am.txt.gz 16,086,628 206M am_dedup.txt.gz
Arabic 8,117,162,828 82G ar.txt.gz 3,171,221,354 32G ar_dedup.txt.gz
Aragonese 52,896 1.3M an.txt.gz 45,669 801K an_dedup.txt.gz
Armenian 273,919,388 3.7G hy.txt.gz 110,196,043 1.5G hy_dedup.txt.gz
Assamese 6,956,663 113M as.txt.gz 4,366,570 71M as_dedup.txt.gz
Asturian 381,005 2.4M ast.txt.gz 325,237 2.0M ast_dedup.txt.gz
Avaric 24,720 409K av.txt.gz 19,478 324K av_dedup.txt.gz
Azerbaijani 322,641,710 2.8G az.txt.gz 167,742,296 1.5G az_dedup.txt.gz
Bashkir 9,796,764 128M ba.txt.gz 6,922,589 90M ba_dedup.txt.gz
Basque 120,456,652 848M eu.txt.gz 45,359,710 342M eu_dedup.txt.gz
Bavarian 399 503 bar.txt.gz 399 503 bar_dedup.txt.gz
Belarusian 144,579,630 1.8G be.txt.gz 83,499,037 1.1G be_dedup.txt.gz
Bengali 623,575,733 11G bn.txt.gz 363,766,143 5.8G bn_dedup.txt.gz
Bihari 8,848 110K bh.txt.gz 2,875 34K bh_dedup.txt.gz
Bishnupriya 198,286 4.1M bpy.txt.gz 96,940 1.7M bpy_dedup.txt.gz
Bosnian 106,448 447K bs.txt.gz 20,485 116K bs_dedup.txt.gz
Breton 5,013,241 29M br.txt.gz 2,890,384 16M br_dedup.txt.gz
Bulgarian 2,947,648,106 32G bg.txt.gz 1,268,114,977 14G bg_dedup.txt.gz
Burmese 56,111,184 1.9G my.txt.gz 30,102,173 1.1G my_dedup.txt.gz
Catalan 1,360,212,450 8.0G ca.txt.gz 729,333,440 4.3G ca_dedup.txt.gz
Cebuano 6,603,567 39M ceb.txt.gz 3,675,024 24M ceb_dedup.txt.gz
Central Bikol 312 885 bcl.txt.gz 312 885 bcl_dedup.txt.gz
Central Khmer 20,690,610 1.1G km.txt.gz 10,082,245 581M km_dedup.txt.gz
Central Kurdish 48,478,334 487M ckb.txt.gz 18,726,721 226M ckb_dedup.txt.gz
Chavacano 130 520 cbk.txt.gz 130 520 cbk_dedup.txt.gz
Chechen 711,051 8.3M ce.txt.gz 568,146 6.7M ce_dedup.txt.gz
Chinese 14,986,424,850 508G zh.txt.gz 6,350,215,113 249G zh_dedup.txt.gz
Chuvash 3,041,614 39M cv.txt.gz 2,054,810 26M cv_dedup.txt.gz
Cornish 8,329 44K kw.txt.gz 2,704 14K kw_dedup.txt.gz
Croatian 34,232,765 226M hr.txt.gz 16,727,640 110M hr_dedup.txt.gz
Czech 7,715,977,441 53G cs.txt.gz 3,540,997,509 24G cs_dedup.txt.gz
Danish 2,637,463,889 16G da.txt.gz 1,620,091,317 9.5G da_dedup.txt.gz
Dhivehi 7,559,472 126M dv.txt.gz 4,726,660 79M dv_dedup.txt.gz
Dimli 19 146 diq.txt.gz 19 146 diq_dedup.txt.gz
Dutch 13,020,136,373 78G nl.txt.gz 6,598,786,137 39G nl_dedup.txt.gz
Eastern Mari 565,992 7.2M mhr.txt.gz 469,297 6.0M mhr_dedup.txt.gz
Egyptian Arabic 7,305,151 66M arz.txt.gz 3,659,419 33M arz_dedup.txt.gz
Emilian-Romagnol 6,376 25K eml.txt.gz 6,121 24K eml_dedup.txt.gz
English 418,187,793,408 2.3T en.txt.gz 215,841,256,971 1.2T en_dedup.txt.gz
Erzya 90 1.4K myv.txt.gz 78 1.2K myv_dedup.txt.gz
Esperanto 48,486,161 299M eo.txt.gz 37,324,446 228M eo_dedup.txt.gz
Estonian 643,163,730 4.8G et.txt.gz 309,931,463 2.3G et_dedup.txt.gz
Finnish 3,196,666,419 27G fi.txt.gz 1,597,855,468 13G fi_dedup.txt.gz
French 46,896,036,417 282G fr.txt.gz 23,206,776,649 138G fr_dedup.txt.gz
Galician 102,011,291 620M gl.txt.gz 63,600,602 384M gl_dedup.txt.gz
Georgian 171,950,621 3.6G ka.txt.gz 91,569,739 1.9G ka_dedup.txt.gz
German 44,878,908,446 308G de.txt.gz 21,529,164,172 145G de_dedup.txt.gz
Goan Konkani 124,277 2.2M gom.txt.gz 102,306 1.8M gom_dedup.txt.gz
Guarani 7,382 36K gn.txt.gz 4,680 24K gn_dedup.txt.gz
Gujarati 72,045,701 1.1G gu.txt.gz 50,023,432 722M gu_dedup.txt.gz
Haitian 1,014 3.9K ht.txt.gz 832 3.3K ht_dedup.txt.gz
Hebrew 2,067,753,528 20G he.txt.gz 1,032,018,056 9.8G he_dedup.txt.gz
Hindi 1,372,234,782 17G hi.txt.gz 745,774,934 8.9G hi_dedup.txt.gz
Hungarian 5,163,936,345 40G hu.txt.gz 2,339,127,555 18G hu_dedup.txt.gz
Icelandic 219,900,094 1.5G is.txt.gz 129,818,331 846M is_dedup.txt.gz
Ido 25,702 147K io.txt.gz 22,773 130K io_dedup.txt.gz
Iloko 142,942 874K ilo.txt.gz 105,564 636K ilo_dedup.txt.gz
Indonesian 4,574,692,265 30G id.txt.gz 2,394,957,629 16G id_dedup.txt.gz
Interlingua 180,231 662K ia.txt.gz 100,019 360K ia_dedup.txt.gz
Interlingue 5,352 24K ie.txt.gz 602 1.6K ie_dedup.txt.gz
Irish 14,483,593 88M ga.txt.gz 10,017,303 60M ga_dedup.txt.gz
Italian 22,248,707,341 137G it.txt.gz 11,250,012,896 69G it_dedup.txt.gz
Japanese 4,962,979,182 216G ja.txt.gz 1,123,067,063 106G ja_dedup.txt.gz
Javanese 104,896 659K jv.txt.gz 86,654 583K jv_dedup.txt.gz
Kalmyk 10,277 113K xal.txt.gz 10,155 112K xal_dedup.txt.gz
Kannada 81,186,863 1.7G kn.txt.gz 49,343,462 1.1G kn_dedup.txt.gz
Karachay-Balkar 185,436 2.6M krc.txt.gz 166,496 2.3M krc_dedup.txt.gz
Kazakh 191,126,469 2.7G kk.txt.gz 108,388,743 1.5G kk_dedup.txt.gz
Kirghiz 44,194,823 600M ky.txt.gz 28,982,620 388M ky_dedup.txt.gz
Komi 201,404 2.3M kv.txt.gz 95,243 1.2M kv_dedup.txt.gz
Korean 2,368,765,142 24G ko.txt.gz 1,120,375,149 12G ko_dedup.txt.gz
Kurdish 15,561,003 94M ku.txt.gz 9,946,440 60M ku_dedup.txt.gz
Lao 4,133,311 174M lo.txt.gz 2,583,342 114M lo_dedup.txt.gz
Latin 4,122,201 26M la.txt.gz 1,328,038 8.3M la_dedup.txt.gz
Latvian 520,761,977 4.0G lv.txt.gz 236,428,905 1.8G lv_dedup.txt.gz
Lezghian 247,646 3.3M lez.txt.gz 224,871 3.0M lez_dedup.txt.gz
Limburgan 4,730 29K li.txt.gz 4,283 27K li_dedup.txt.gz
Lithuanian 1,159,661,742 8.8G lt.txt.gz 516,183,525 3.9G lt_dedup.txt.gz
Lojban 154,330 736K jbo.txt.gz 141,973 678K jbo_dedup.txt.gz
Lombard 75,229 443K lmo.txt.gz 73,665 433K lmo_dedup.txt.gz
Low German 2,906,347 18M nds.txt.gz 2,146,417 13M nds_dedup.txt.gz
Lower Sorbian 1,787 13K dsb.txt.gz 966 7.1K dsb_dedup.txt.gz
Luxembourgish 4,403,577 29M lb.txt.gz 3,087,650 21M lb_dedup.txt.gz
Macedonian 189,289,873 2.1G mk.txt.gz 102,849,595 1.2G mk_dedup.txt.gz
Maithili 69,161 317K mai.txt.gz 874 11K mai_dedup.txt.gz
Malagasy 3,068,360 21M mg.txt.gz 1,872,044 13M mg_dedup.txt.gz
Malay 16,696,882 111M ms.txt.gz 6,045,753 42M ms_dedup.txt.gz
Malayalam 189,534,472 4.9G ml.txt.gz 95,892,551 2.5G ml_dedup.txt.gz
Maltese 2,995,654 24M mt.txt.gz 2,163,358 17M mt_dedup.txt.gz
Marathi 162,609,404 2.7G mr.txt.gz 82,130,803 1.4G mr_dedup.txt.gz
Mazanderani 73,870 691K mzn.txt.gz 64,481 602K mzn_dedup.txt.gz
Minangkabau 5,682 608K min.txt.gz 4,825 310K min_dedup.txt.gz
Mingrelian 299,098 5.8M xmf.txt.gz 228,629 4.4M xmf_dedup.txt.gz
Mirandese 171 1.2K mwl.txt.gz 152 1.1K mwl_dedup.txt.gz
Modern Greek 5,479,180,137 62G el.txt.gz 2,412,419,435 27G el_dedup.txt.gz
Mongolian 181,307,167 2.2G mn.txt.gz 68,362,013 838M mn_dedup.txt.gz
Nahuatl languages 1,234 12K nah.txt.gz 1,193 11K nah_dedup.txt.gz
Neapolitan 5,282 17K nap.txt.gz 4,147 13K nap_dedup.txt.gz
Nepali 107,448,208 1.8G ne.txt.gz 71,628,317 1.2G ne_dedup.txt.gz
Newari 564,697 5.5M new.txt.gz 288,995 4.1M new_dedup.txt.gz
Northern Frisian 1,516 4.4K frr.txt.gz 1,516 4.4K frr_dedup.txt.gz
Northern Luri 8,022 76K lrc.txt.gz 6,740 63K lrc_dedup.txt.gz
Norwegian 1,344,326,388 8.0G no.txt.gz 804,894,377 4.7G no_dedup.txt.gz
Norwegian Nynorsk 14,764,980 85M nn.txt.gz 9,435,139 54M nn_dedup.txt.gz
Occitan 750,301 5.8M oc.txt.gz 512,678 3.7M oc_dedup.txt.gz
Oriya 14,938,567 248M or.txt.gz 11,321,740 188M or_dedup.txt.gz
Ossetian 1,031,268 13M os.txt.gz 878,765 11M os_dedup.txt.gz
Pampanga 130 760 pam.txt.gz 52 304 pam_dedup.txt.gz
Panjabi 61,847,806 763M pa.txt.gz 37,555,835 460M pa_dedup.txt.gz
Persian 9,096,554,121 79G fa.txt.gz 4,363,505,319 38G fa_dedup.txt.gz
Piemontese 362,013 2.1M pms.txt.gz 337,246 1.9M pms_dedup.txt.gz
Polish 15,277,255,137 109G pl.txt.gz 6,708,709,674 47G pl_dedup.txt.gz
Portuguese 20,641,903,898 124G pt.txt.gz 10,751,156,918 64G pt_dedup.txt.gz
Pushto 46,559,441 361M ps.txt.gz 31,347,348 242M ps_dedup.txt.gz
Quechua 10,186 78K qu.txt.gz 8,691 67K qu_dedup.txt.gz
Romanian 3,984,317,058 25G ro.txt.gz 1,741,794,069 11G ro_dedup.txt.gz
Romansh 1,093 7.4K rm.txt.gz 960 6.5K rm_dedup.txt.gz
Russia Buriat 963 13K bxr.txt.gz 809 11K bxr_dedup.txt.gz
Russian 92,522,407,837 1.2T ru.txt.gz 46,692,691,520 568G ru_dedup.txt.gz
Sanskrit 4,331,569 93M sa.txt.gz 1,713,930 37M sa_dedup.txt.gz
Scottish Gaelic 310,689 1.9M gd.txt.gz 207,110 1.3M gd_dedup.txt.gz
Serbian 364,395,411 3.9G sr.txt.gz 207,561,168 2.2G sr_dedup.txt.gz
Serbo-Croatian 5,292,184 25M sh.txt.gz 1,040,573 5.8M sh_dedup.txt.gz
Sicilian 554 3.3K scn.txt.gz 468 2.8K scn_dedup.txt.gz
Sindhi 43,530,158 347M sd.txt.gz 33,028,015 263M sd_dedup.txt.gz
Sinhala 93,053,465 1.4G si.txt.gz 50,864,857 802M si_dedup.txt.gz
Slovak 1,322,247,763 9.1G sk.txt.gz 656,346,179 4.5G sk_dedup.txt.gz
Slovenian 387,399,700 2.5G sl.txt.gz 193,926,684 1.3G sl_dedup.txt.gz
Somali 1,202 61K so.txt.gz 472 16K so_dedup.txt.gz
South Azerbaijani 2,175,054 27M azb.txt.gz 1,528,709 19M azb_dedup.txt.gz
Spanish 47,545,122,279 278G es.txt.gz 25,928,290,729 149G es_dedup.txt.gz
Sundanese 30,321 211K su.txt.gz 20,278 141K su_dedup.txt.gz
Swahili 2,211,927 13M sw.txt.gz 1,376,963 8.1M sw_dedup.txt.gz
Swedish 7,155,994,312 44G sv.txt.gz 4,106,120,608 25G sv_dedup.txt.gz
Tagalog 98,949,299 573M tl.txt.gz 70,121,601 407M tl_dedup.txt.gz
Tajik 31,758,142 379M tg.txt.gz 21,029,893 249M tg_dedup.txt.gz
Tamil 420,537,132 9.3G ta.txt.gz 226,013,330 5.1G ta_dedup.txt.gz
Tatar 51,034,893 670M tt.txt.gz 23,825,695 305M tt_dedup.txt.gz
Telugu 123,711,517 2.5G te.txt.gz 79,094,167 1.6G te_dedup.txt.gz
Thai 951,743,087 36G th.txt.gz 368,965,202 16G th_dedup.txt.gz
Tibetan 1,483,589 187M bo.txt.gz 936,556 138M bo_dedup.txt.gz
Tosk Albanian 841,750 5.0M als.txt.gz 459,001 2.8M als_dedup.txt.gz
Turkish 7,577,388,700 60G tr.txt.gz 3,365,734,289 27G tr_dedup.txt.gz
Turkmen 1,113,869 11M tk.txt.gz 752,326 6.8M tk_dedup.txt.gz
Tuvinian 759 12K tyv.txt.gz 540 7.9K tyv_dedup.txt.gz
Uighur 8,657,141 122M ug.txt.gz 5,852,225 83M ug_dedup.txt.gz
Ukrainian 4,204,381,276 53G uk.txt.gz 2,252,380,351 28G uk_dedup.txt.gz
Upper Sorbian 545,351 4.2M hsb.txt.gz 236,867 1.8M hsb_dedup.txt.gz
Urdu 331,817,982 2.7G ur.txt.gz 218,030,228 1.7G ur_dedup.txt.gz
Uzbek 2,450,256 21M uz.txt.gz 1,381,644 12M uz_dedup.txt.gz
Venetian 3,492 18K vec.txt.gz 3,199 17K vec_dedup.txt.gz
Vietnamese 12,036,845,359 68G vi.txt.gz 5,577,159,843 32G vi_dedup.txt.gz
Volapük 321,121 2.0M vo.txt.gz 318,568 2.0M vo_dedup.txt.gz
Walloon 50,720 273K wa.txt.gz 37,543 203K wa_dedup.txt.gz
Waray 397,315 2.5M war.txt.gz 336,311 2.2M war_dedup.txt.gz
Welsh 37,422,441 213M cy.txt.gz 23,574,673 133M cy_dedup.txt.gz
Western Frisian 5,691,077 35M fy.txt.gz 4,223,816 26M fy_dedup.txt.gz
Western Mari 93,338 1.2M mrj.txt.gz 87,780 1.1M mrj_dedup.txt.gz
Western Panjabi 1,426,986 12M pnb.txt.gz 1,111,112 9.0M pnb_dedup.txt.gz
Wu Chinese 11,189 109K wuu.txt.gz 4,333 32K wuu_dedup.txt.gz
Yakut 2,547,623 42M sah.txt.gz 1,789,174 26M sah_dedup.txt.gz
Yiddish 13,834,320 141M yi.txt.gz 8,212,970 84M yi_dedup.txt.gz
Yoruba 8,906 55K yo.txt.gz 3,518 27K yo_dedup.txt.gz
Yue Chinese 186 3.7K yue.txt.gz 128 2.2K yue_dedup.txt.gz


These data are released under this licensing scheme:

  • We do not own any of the text from which these data has been extracted.
  • We license the actual packaging of these data under the Creative Commons CC0 license (“no rights reserved”).
  • To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR.
  • This work is published from: France.


Notice and take down policy

Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

  • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
  • Clearly identify the copyrighted work claimed to be infringed.
  • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
  • And use the contact form below.

Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpus.