When.com Web Search

Search results

  1. Results From The WOW.Com Content Network
  2. List of text corpora - Wikipedia

    en.wikipedia.org/wiki/List_of_text_corpora

    Text corpora (singular: text corpus) are large and structured sets of texts, which have been systematically collected.Text corpora are used by corpus linguists and within other branches of linguistics for statistical analysis, hypothesis testing, finding patterns of language use, investigating language change and variation, and teaching language proficiency.

  3. Text corpus - Wikipedia

    en.wikipedia.org/wiki/Text_corpus

    Text corpora are also used in the study of historical documents, for example in attempts to decipher ancient scripts, or in Biblical scholarship. Some archaeological corpora can be of such short duration that they provide a snapshot in time. One of the shortest corpora in time may be the 15–30 year Amarna letters texts .

  4. Category:Corpora - Wikipedia

    en.wikipedia.org/wiki/Category:Corpora

    Main page; Contents; Current events; Random article; About Wikipedia; Contact us; Pages for logged out editors learn more

  5. Talk:List of text corpora - Wikipedia

    en.wikipedia.org/wiki/Talk:List_of_text_corpora

    This article is within the scope of WikiProject Lists, an attempt to structure and organize all list pages on Wikipedia. If you wish to help, please visit the project page, where you can join the project and/or contribute to the discussion. Lists Wikipedia:WikiProject Lists Template:WikiProject Lists List: Low

  6. Google Books Ngram Viewer - Wikipedia

    en.wikipedia.org/wiki/Google_Books_Ngram_Viewer

    [1] [2] [5] There are also some specialized English corpora, such as American English, British English, and English Fiction. [6] The program can search for a word or a phrase, including misspellings or gibberish. [5] The n-grams are matched with the text within the selected corpus, and if found in 40 or more books, are then displayed as a graph ...

  7. Ancient text corpora - Wikipedia

    en.wikipedia.org/wiki/Ancient_text_corpora

    Ancient text corpora are the entire collection of texts from the period of ancient history, defined in this article as the period from the beginning of writing up to 300 AD. These corpora are important for the study of literature , history , linguistics , and other fields, and are a fundamental component of the world's cultural heritage .

  8. The Pile (dataset) - Wikipedia

    en.wikipedia.org/wiki/The_Pile_(dataset)

    The Pile is an 886.03 GB diverse, open-source dataset of English text created as a training dataset for large language models (LLMs). It was constructed by EleutherAI in 2020 and publicly released on December 31 of that year. [1] [2] It is composed of 22 smaller datasets, including 14 new ones. [1]

  9. Wu Dao - Wikipedia

    en.wikipedia.org/wiki/Wu_Dao

    WuDao Corpora (also written as WuDaoCorpora), as of version 2.0, was a large dataset constructed for training Wu Dao 2.0. It contains 3 terabytes of text scraped from web data, 90 terabytes of graphical data (incorporating 630 million text/image pairs), and 181 gigabytes of Chinese dialogue (incorporating 1.4 billion dialogue rounds). [19]