OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
OSCAR is currently shuffled at line level and no metadata is provided. Thus it is mainly intended to be used in the training of unsupervised language models for NLP.
Data is distributed by language in both original and deduplicated form. There are currently 166 different languages available. If you use OSCAR please consider giving us some feedback using the contact form down below. Also consider citing our paper .
If you want to contribute to OSCAR, for example by tokenizing one of the corpora for a particular language, or by helping us translate our webpage, please open a pull request here .
The corpus was put together by Pedro J. Ortiz , Benoît Sagot , and Laurent Romary .
All the data is distributed by language, both the original and the deduplicated versions of the data are available. To download a file just click the desired link on the table below. We recommend the use of pigz to decompress the bigger files in OSCAR.
All sizes are for the uncompressed files.
| Language | Words original | Size original | File original | Words deduplicated | Size deduplicated | File deduplicated |
|---|---|---|---|---|---|---|
| Afrikaans | 43,482,801 | 241M | af.txt.gz | 29,533,437 | 163M | af_dedup.txt.gz |
| Albanian | 374,196,110 | 2.3G | sq.txt.gz | 186,856,699 | 1.2G | sq_dedup.txt.gz |
| Amharic | 28,301,601 | 360M | am.txt.gz | 16,086,628 | 206M | am_dedup.txt.gz |
| Arabic | 8,117,162,828 | 82G | ar.txt.gz | 3,171,221,354 | 32G | ar_dedup.txt.gz |
| Aragonese | 52,896 | 1.3M | an.txt.gz | 45,669 | 801K | an_dedup.txt.gz |
| Armenian | 273,919,388 | 3.7G | hy.txt.gz | 110,196,043 | 1.5G | hy_dedup.txt.gz |
| Assamese | 6,956,663 | 113M | as.txt.gz | 4,366,570 | 71M | as_dedup.txt.gz |
| Asturian | 381,005 | 2.4M | ast.txt.gz | 325,237 | 2.0M | ast_dedup.txt.gz |
| Avaric | 24,720 | 409K | av.txt.gz | 19,478 | 324K | av_dedup.txt.gz |
| Azerbaijani | 322,641,710 | 2.8G | az.txt.gz | 167,742,296 | 1.5G | az_dedup.txt.gz |
| Bashkir | 9,796,764 | 128M | ba.txt.gz | 6,922,589 | 90M | ba_dedup.txt.gz |
| Basque | 120,456,652 | 848M | eu.txt.gz | 45,359,710 | 342M | eu_dedup.txt.gz |
| Bavarian | 399 | 503 | bar.txt.gz | 399 | 503 | bar_dedup.txt.gz |
| Belarusian | 144,579,630 | 1.8G | be.txt.gz | 83,499,037 | 1.1G | be_dedup.txt.gz |
| Bengali | 623,575,733 | 11G | bn.txt.gz | 363,766,143 | 5.8G | bn_dedup.txt.gz |
| Bihari | 8,848 | 110K | bh.txt.gz | 2,875 | 34K | bh_dedup.txt.gz |
| Bishnupriya | 198,286 | 4.1M | bpy.txt.gz | 96,940 | 1.7M | bpy_dedup.txt.gz |
| Bosnian | 106,448 | 447K | bs.txt.gz | 20,485 | 116K | bs_dedup.txt.gz |
| Breton | 5,013,241 | 29M | br.txt.gz | 2,890,384 | 16M | br_dedup.txt.gz |
| Bulgarian | 2,947,648,106 | 32G | bg.txt.gz | 1,268,114,977 | 14G | bg_dedup.txt.gz |
| Burmese | 56,111,184 | 1.9G | my.txt.gz | 30,102,173 | 1.1G | my_dedup.txt.gz |
| Catalan | 1,360,212,450 | 8.0G | ca.txt.gz | 729,333,440 | 4.3G | ca_dedup.txt.gz |
| Cebuano | 6,603,567 | 39M | ceb.txt.gz | 3,675,024 | 24M | ceb_dedup.txt.gz |
| Central Bikol | 312 | 885 | bcl.txt.gz | 312 | 885 | bcl_dedup.txt.gz |
| Central Khmer | 20,690,610 | 1.1G | km.txt.gz | 10,082,245 | 581M | km_dedup.txt.gz |
| Central Kurdish | 48,478,334 | 487M | ckb.txt.gz | 18,726,721 | 226M | ckb_dedup.txt.gz |
| Chavacano | 130 | 520 | cbk.txt.gz | 130 | 520 | cbk_dedup.txt.gz |
| Chechen | 711,051 | 8.3M | ce.txt.gz | 568,146 | 6.7M | ce_dedup.txt.gz |
| Chinese | 14,986,424,850 | 508G | zh.txt.gz | 6,350,215,113 | 249G | zh_dedup.txt.gz |
| Chuvash | 3,041,614 | 39M | cv.txt.gz | 2,054,810 | 26M | cv_dedup.txt.gz |
| Cornish | 8,329 | 44K | kw.txt.gz | 2,704 | 14K | kw_dedup.txt.gz |
| Croatian | 34,232,765 | 226M | hr.txt.gz | 16,727,640 | 110M | hr_dedup.txt.gz |
| Czech | 7,715,977,441 | 53G | cs.txt.gz | 3,540,997,509 | 24G | cs_dedup.txt.gz |
| Danish | 2,637,463,889 | 16G | da.txt.gz | 1,620,091,317 | 9.5G | da_dedup.txt.gz |
| Dhivehi | 7,559,472 | 126M | dv.txt.gz | 4,726,660 | 79M | dv_dedup.txt.gz |
| Dimli | 19 | 146 | diq.txt.gz | 19 | 146 | diq_dedup.txt.gz |
| Dutch | 13,020,136,373 | 78G | nl.txt.gz | 6,598,786,137 | 39G | nl_dedup.txt.gz |
| Eastern Mari | 565,992 | 7.2M | mhr.txt.gz | 469,297 | 6.0M | mhr_dedup.txt.gz |
| Egyptian Arabic | 7,305,151 | 66M | arz.txt.gz | 3,659,419 | 33M | arz_dedup.txt.gz |
| Emilian-Romagnol | 6,376 | 25K | eml.txt.gz | 6,121 | 24K | eml_dedup.txt.gz |
| English | 418,187,793,408 | 2.3T | en.txt.gz | 215,841,256,971 | 1.2T | en_dedup.txt.gz |
| Erzya | 90 | 1.4K | myv.txt.gz | 78 | 1.2K | myv_dedup.txt.gz |
| Esperanto | 48,486,161 | 299M | eo.txt.gz | 37,324,446 | 228M | eo_dedup.txt.gz |
| Estonian | 643,163,730 | 4.8G | et.txt.gz | 309,931,463 | 2.3G | et_dedup.txt.gz |
| Finnish | 3,196,666,419 | 27G | fi.txt.gz | 1,597,855,468 | 13G | fi_dedup.txt.gz |
| French | 46,896,036,417 | 282G | fr.txt.gz | 23,206,776,649 | 138G | fr_dedup.txt.gz |
| Galician | 102,011,291 | 620M | gl.txt.gz | 63,600,602 | 384M | gl_dedup.txt.gz |
| Georgian | 171,950,621 | 3.6G | ka.txt.gz | 91,569,739 | 1.9G | ka_dedup.txt.gz |
| German | 44,878,908,446 | 308G | de.txt.gz | 21,529,164,172 | 145G | de_dedup.txt.gz |
| Goan Konkani | 124,277 | 2.2M | gom.txt.gz | 102,306 | 1.8M | gom_dedup.txt.gz |
| Guarani | 7,382 | 36K | gn.txt.gz | 4,680 | 24K | gn_dedup.txt.gz |
| Gujarati | 72,045,701 | 1.1G | gu.txt.gz | 50,023,432 | 722M | gu_dedup.txt.gz |
| Haitian | 1,014 | 3.9K | ht.txt.gz | 832 | 3.3K | ht_dedup.txt.gz |
| Hebrew | 2,067,753,528 | 20G | he.txt.gz | 1,032,018,056 | 9.8G | he_dedup.txt.gz |
| Hindi | 1,372,234,782 | 17G | hi.txt.gz | 745,774,934 | 8.9G | hi_dedup.txt.gz |
| Hungarian | 5,163,936,345 | 40G | hu.txt.gz | 2,339,127,555 | 18G | hu_dedup.txt.gz |
| Icelandic | 219,900,094 | 1.5G | is.txt.gz | 129,818,331 | 846M | is_dedup.txt.gz |
| Ido | 25,702 | 147K | io.txt.gz | 22,773 | 130K | io_dedup.txt.gz |
| Iloko | 142,942 | 874K | ilo.txt.gz | 105,564 | 636K | ilo_dedup.txt.gz |
| Indonesian | 4,574,692,265 | 30G | id.txt.gz | 2,394,957,629 | 16G | id_dedup.txt.gz |
| Interlingua | 180,231 | 662K | ia.txt.gz | 100,019 | 360K | ia_dedup.txt.gz |
| Interlingue | 5,352 | 24K | ie.txt.gz | 602 | 1.6K | ie_dedup.txt.gz |
| Irish | 14,483,593 | 88M | ga.txt.gz | 10,017,303 | 60M | ga_dedup.txt.gz |
| Italian | 22,248,707,341 | 137G | it.txt.gz | 11,250,012,896 | 69G | it_dedup.txt.gz |
| Japanese | 4,962,979,182 | 216G | ja.txt.gz | 1,123,067,063 | 106G | ja_dedup.txt.gz |
| Javanese | 104,896 | 659K | jv.txt.gz | 86,654 | 583K | jv_dedup.txt.gz |
| Kalmyk | 10,277 | 113K | xal.txt.gz | 10,155 | 112K | xal_dedup.txt.gz |
| Kannada | 81,186,863 | 1.7G | kn.txt.gz | 49,343,462 | 1.1G | kn_dedup.txt.gz |
| Karachay-Balkar | 185,436 | 2.6M | krc.txt.gz | 166,496 | 2.3M | krc_dedup.txt.gz |
| Kazakh | 191,126,469 | 2.7G | kk.txt.gz | 108,388,743 | 1.5G | kk_dedup.txt.gz |
| Kirghiz | 44,194,823 | 600M | ky.txt.gz | 28,982,620 | 388M | ky_dedup.txt.gz |
| Komi | 201,404 | 2.3M | kv.txt.gz | 95,243 | 1.2M | kv_dedup.txt.gz |
| Korean | 2,368,765,142 | 24G | ko.txt.gz | 1,120,375,149 | 12G | ko_dedup.txt.gz |
| Kurdish | 15,561,003 | 94M | ku.txt.gz | 9,946,440 | 60M | ku_dedup.txt.gz |
| Lao | 4,133,311 | 174M | lo.txt.gz | 2,583,342 | 114M | lo_dedup.txt.gz |
| Latin | 4,122,201 | 26M | la.txt.gz | 1,328,038 | 8.3M | la_dedup.txt.gz |
| Latvian | 520,761,977 | 4.0G | lv.txt.gz | 236,428,905 | 1.8G | lv_dedup.txt.gz |
| Lezghian | 247,646 | 3.3M | lez.txt.gz | 224,871 | 3.0M | lez_dedup.txt.gz |
| Limburgan | 4,730 | 29K | li.txt.gz | 4,283 | 27K | li_dedup.txt.gz |
| Lithuanian | 1,159,661,742 | 8.8G | lt.txt.gz | 516,183,525 | 3.9G | lt_dedup.txt.gz |
| Lojban | 154,330 | 736K | jbo.txt.gz | 141,973 | 678K | jbo_dedup.txt.gz |
| Lombard | 75,229 | 443K | lmo.txt.gz | 73,665 | 433K | lmo_dedup.txt.gz |
| Low German | 2,906,347 | 18M | nds.txt.gz | 2,146,417 | 13M | nds_dedup.txt.gz |
| Lower Sorbian | 1,787 | 13K | dsb.txt.gz | 966 | 7.1K | dsb_dedup.txt.gz |
| Luxembourgish | 4,403,577 | 29M | lb.txt.gz | 3,087,650 | 21M | lb_dedup.txt.gz |
| Macedonian | 189,289,873 | 2.1G | mk.txt.gz | 102,849,595 | 1.2G | mk_dedup.txt.gz |
| Maithili | 69,161 | 317K | mai.txt.gz | 874 | 11K | mai_dedup.txt.gz |
| Malagasy | 3,068,360 | 21M | mg.txt.gz | 1,872,044 | 13M | mg_dedup.txt.gz |
| Malay | 16,696,882 | 111M | ms.txt.gz | 6,045,753 | 42M | ms_dedup.txt.gz |
| Malayalam | 189,534,472 | 4.9G | ml.txt.gz | 95,892,551 | 2.5G | ml_dedup.txt.gz |
| Maltese | 2,995,654 | 24M | mt.txt.gz | 2,163,358 | 17M | mt_dedup.txt.gz |
| Marathi | 162,609,404 | 2.7G | mr.txt.gz | 82,130,803 | 1.4G | mr_dedup.txt.gz |
| Mazanderani | 73,870 | 691K | mzn.txt.gz | 64,481 | 602K | mzn_dedup.txt.gz |
| Minangkabau | 5,682 | 608K | min.txt.gz | 4,825 | 310K | min_dedup.txt.gz |
| Mingrelian | 299,098 | 5.8M | xmf.txt.gz | 228,629 | 4.4M | xmf_dedup.txt.gz |
| Mirandese | 171 | 1.2K | mwl.txt.gz | 152 | 1.1K | mwl_dedup.txt.gz |
| Modern Greek | 5,479,180,137 | 62G | el.txt.gz | 2,412,419,435 | 27G | el_dedup.txt.gz |
| Mongolian | 181,307,167 | 2.2G | mn.txt.gz | 68,362,013 | 838M | mn_dedup.txt.gz |
| Nahuatl languages | 1,234 | 12K | nah.txt.gz | 1,193 | 11K | nah_dedup.txt.gz |
| Neapolitan | 5,282 | 17K | nap.txt.gz | 4,147 | 13K | nap_dedup.txt.gz |
| Nepali | 107,448,208 | 1.8G | ne.txt.gz | 71,628,317 | 1.2G | ne_dedup.txt.gz |
| Newari | 564,697 | 5.5M | new.txt.gz | 288,995 | 4.1M | new_dedup.txt.gz |
| Northern Frisian | 1,516 | 4.4K | frr.txt.gz | 1,516 | 4.4K | frr_dedup.txt.gz |
| Northern Luri | 8,022 | 76K | lrc.txt.gz | 6,740 | 63K | lrc_dedup.txt.gz |
| Norwegian | 1,344,326,388 | 8.0G | no.txt.gz | 804,894,377 | 4.7G | no_dedup.txt.gz |
| Norwegian Nynorsk | 14,764,980 | 85M | nn.txt.gz | 9,435,139 | 54M | nn_dedup.txt.gz |
| Occitan | 750,301 | 5.8M | oc.txt.gz | 512,678 | 3.7M | oc_dedup.txt.gz |
| Oriya | 14,938,567 | 248M | or.txt.gz | 11,321,740 | 188M | or_dedup.txt.gz |
| Ossetian | 1,031,268 | 13M | os.txt.gz | 878,765 | 11M | os_dedup.txt.gz |
| Pampanga | 130 | 760 | pam.txt.gz | 52 | 304 | pam_dedup.txt.gz |
| Panjabi | 61,847,806 | 763M | pa.txt.gz | 37,555,835 | 460M | pa_dedup.txt.gz |
| Persian | 9,096,554,121 | 79G | fa.txt.gz | 4,363,505,319 | 38G | fa_dedup.txt.gz |
| Piemontese | 362,013 | 2.1M | pms.txt.gz | 337,246 | 1.9M | pms_dedup.txt.gz |
| Polish | 15,277,255,137 | 109G | pl.txt.gz | 6,708,709,674 | 47G | pl_dedup.txt.gz |
| Portuguese | 20,641,903,898 | 124G | pt.txt.gz | 10,751,156,918 | 64G | pt_dedup.txt.gz |
| Pushto | 46,559,441 | 361M | ps.txt.gz | 31,347,348 | 242M | ps_dedup.txt.gz |
| Quechua | 10,186 | 78K | qu.txt.gz | 8,691 | 67K | qu_dedup.txt.gz |
| Romanian | 3,984,317,058 | 25G | ro.txt.gz | 1,741,794,069 | 11G | ro_dedup.txt.gz |
| Romansh | 1,093 | 7.4K | rm.txt.gz | 960 | 6.5K | rm_dedup.txt.gz |
| Russia Buriat | 963 | 13K | bxr.txt.gz | 809 | 11K | bxr_dedup.txt.gz |
| Russian | 92,522,407,837 | 1.2T | ru.txt.gz | 46,692,691,520 | 568G | ru_dedup.txt.gz |
| Sanskrit | 4,331,569 | 93M | sa.txt.gz | 1,713,930 | 37M | sa_dedup.txt.gz |
| Scottish Gaelic | 310,689 | 1.9M | gd.txt.gz | 207,110 | 1.3M | gd_dedup.txt.gz |
| Serbian | 364,395,411 | 3.9G | sr.txt.gz | 207,561,168 | 2.2G | sr_dedup.txt.gz |
| Serbo-Croatian | 5,292,184 | 25M | sh.txt.gz | 1,040,573 | 5.8M | sh_dedup.txt.gz |
| Sicilian | 554 | 3.3K | scn.txt.gz | 468 | 2.8K | scn_dedup.txt.gz |
| Sindhi | 43,530,158 | 347M | sd.txt.gz | 33,028,015 | 263M | sd_dedup.txt.gz |
| Sinhala | 93,053,465 | 1.4G | si.txt.gz | 50,864,857 | 802M | si_dedup.txt.gz |
| Slovak | 1,322,247,763 | 9.1G | sk.txt.gz | 656,346,179 | 4.5G | sk_dedup.txt.gz |
| Slovenian | 387,399,700 | 2.5G | sl.txt.gz | 193,926,684 | 1.3G | sl_dedup.txt.gz |
| Somali | 1,202 | 61K | so.txt.gz | 472 | 16K | so_dedup.txt.gz |
| South Azerbaijani | 2,175,054 | 27M | azb.txt.gz | 1,528,709 | 19M | azb_dedup.txt.gz |
| Spanish | 47,545,122,279 | 278G | es.txt.gz | 25,928,290,729 | 149G | es_dedup.txt.gz |
| Sundanese | 30,321 | 211K | su.txt.gz | 20,278 | 141K | su_dedup.txt.gz |
| Swahili | 2,211,927 | 13M | sw.txt.gz | 1,376,963 | 8.1M | sw_dedup.txt.gz |
| Swedish | 7,155,994,312 | 44G | sv.txt.gz | 4,106,120,608 | 25G | sv_dedup.txt.gz |
| Tagalog | 98,949,299 | 573M | tl.txt.gz | 70,121,601 | 407M | tl_dedup.txt.gz |
| Tajik | 31,758,142 | 379M | tg.txt.gz | 21,029,893 | 249M | tg_dedup.txt.gz |
| Tamil | 420,537,132 | 9.3G | ta.txt.gz | 226,013,330 | 5.1G | ta_dedup.txt.gz |
| Tatar | 51,034,893 | 670M | tt.txt.gz | 23,825,695 | 305M | tt_dedup.txt.gz |
| Telugu | 123,711,517 | 2.5G | te.txt.gz | 79,094,167 | 1.6G | te_dedup.txt.gz |
| Thai | 951,743,087 | 36G | th.txt.gz | 368,965,202 | 16G | th_dedup.txt.gz |
| Tibetan | 1,483,589 | 187M | bo.txt.gz | 936,556 | 138M | bo_dedup.txt.gz |
| Tosk Albanian | 841,750 | 5.0M | als.txt.gz | 459,001 | 2.8M | als_dedup.txt.gz |
| Turkish | 7,577,388,700 | 60G | tr.txt.gz | 3,365,734,289 | 27G | tr_dedup.txt.gz |
| Turkmen | 1,113,869 | 11M | tk.txt.gz | 752,326 | 6.8M | tk_dedup.txt.gz |
| Tuvinian | 759 | 12K | tyv.txt.gz | 540 | 7.9K | tyv_dedup.txt.gz |
| Uighur | 8,657,141 | 122M | ug.txt.gz | 5,852,225 | 83M | ug_dedup.txt.gz |
| Ukrainian | 4,204,381,276 | 53G | uk.txt.gz | 2,252,380,351 | 28G | uk_dedup.txt.gz |
| Upper Sorbian | 545,351 | 4.2M | hsb.txt.gz | 236,867 | 1.8M | hsb_dedup.txt.gz |
| Urdu | 331,817,982 | 2.7G | ur.txt.gz | 218,030,228 | 1.7G | ur_dedup.txt.gz |
| Uzbek | 2,450,256 | 21M | uz.txt.gz | 1,381,644 | 12M | uz_dedup.txt.gz |
| Venetian | 3,492 | 18K | vec.txt.gz | 3,199 | 17K | vec_dedup.txt.gz |
| Vietnamese | 12,036,845,359 | 68G | vi.txt.gz | 5,577,159,843 | 32G | vi_dedup.txt.gz |
| Volapük | 321,121 | 2.0M | vo.txt.gz | 318,568 | 2.0M | vo_dedup.txt.gz |
| Walloon | 50,720 | 273K | wa.txt.gz | 37,543 | 203K | wa_dedup.txt.gz |
| Waray | 397,315 | 2.5M | war.txt.gz | 336,311 | 2.2M | war_dedup.txt.gz |
| Welsh | 37,422,441 | 213M | cy.txt.gz | 23,574,673 | 133M | cy_dedup.txt.gz |
| Western Frisian | 5,691,077 | 35M | fy.txt.gz | 4,223,816 | 26M | fy_dedup.txt.gz |
| Western Mari | 93,338 | 1.2M | mrj.txt.gz | 87,780 | 1.1M | mrj_dedup.txt.gz |
| Western Panjabi | 1,426,986 | 12M | pnb.txt.gz | 1,111,112 | 9.0M | pnb_dedup.txt.gz |
| Wu Chinese | 11,189 | 109K | wuu.txt.gz | 4,333 | 32K | wuu_dedup.txt.gz |
| Yakut | 2,547,623 | 42M | sah.txt.gz | 1,789,174 | 26M | sah_dedup.txt.gz |
| Yiddish | 13,834,320 | 141M | yi.txt.gz | 8,212,970 | 84M | yi_dedup.txt.gz |
| Yoruba | 8,906 | 55K | yo.txt.gz | 3,518 | 27K | yo_dedup.txt.gz |
| Yue Chinese | 186 | 3.7K | yue.txt.gz | 128 | 2.2K | yue_dedup.txt.gz |
These data are released under this licensing scheme:
Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpus.