This package provides infrastructure to make text datasets available within R, even when they are too large to store within an R package or are licensed in such a way that prevents them from being included in OSS-licensed packages.
Do you want to add a new dataset to the textdata package?
- Create a R file named
prefix_*.Rin theR/folder, where*is the name of the dataset. Supported prefixes includedataset_lexicon_
- Inside that file create 3 functions named
download_*(),process_*()anddataset_*().- The
download_*()function should take 1 argument namedfolder_path. It has 2 tasks, first it should check if the file is already downloaded. If it is already downloaded it should returninvisible(). If the file isn’t at the path it should download the file to said path. - The
process_*()function should take 2 arguments,folder_pathandname_path.folder_pathdenotes the the path to the file returned bydownload_*andname_pathis the path to where the polished data should live. Main point ofprocess_*()is to turn the downloaded file into a .rds file containing a tidy tibble. - The
dataset_*()function should wrap theload_dataset().
- The
- Add the
process_*()function to the named listprocess_functionsin the file process_functions.R. - Add the
download_*()function to the named listdownload_functionsin the file download_functions.R. - Modify the
print_infolist in the info.R file. - Add
dataset_*.Rto the @include tags indownload_functions.R. - Add the dataset to the table in
README.Rmd. - Add the dataset to
_pkgdown.yml. - Write a bullet in the
NEWS.md file.
What are the guidelines for adding datasets?
Guidelines for textdata datasets
- All datasets must have a license or terms of use clearly specified.
- Data should be a vector or tibble.
- Use
wordinstead ofwordsfor column names.
Classification datasets
For datasets that comes with a testing and training dataset. Let the
user pick which one to retrieve with a split argument
similar to how dataset_ag_news() is doing.
