This package provides infrastructure to make text datasets available within R, even when they are too large to store within an R package or are licensed in such a way that prevents them from being included in OSS-licensed packages.
Do you want to add a new dataset to the textdata package?
- Create a R file named
prefix_*.R
in theR/
folder, where*
is the name of the dataset. Supported prefixes includedataset_
lexicon_
- Inside that file create 3 functions named
download_*()
,process_*()
anddataset_*()
.- The
download_*()
function should take 1 argument namedfolder_path
. It has 2 tasks, first it should check if the file is already downloaded. If it is already downloaded it should returninvisible()
. If the file isn’t at the path it should download the file to said path. - The
process_*()
function should take 2 arguments,folder_path
andname_path
.folder_path
denotes the the path to the file returned bydownload_*
andname_path
is the path to where the polished data should live. Main point ofprocess_*()
is to turn the downloaded file into a .rds file containing a tidy tibble. - The
dataset_*()
function should wrap theload_dataset()
.
- The
- Add the
process_*()
function to the named listprocess_functions
in the file process_functions.R. - Add the
download_*()
function to the named listdownload_functions
in the file download_functions.R. - Modify the
print_info
list in the info.R file. - Add
dataset_*.R
to the @include tags indownload_functions.R
. - Add the dataset to the table in
README.Rmd
. - Add the dataset to
_pkgdown.yml
. - Write a bullet in the
NEWS.md file
.
What are the guidelines for adding datasets?
Guidelines for textdata datasets
- All datasets must have a license or terms of use clearly specified.
- Data should be a vector or tibble.
- Use
word
instead ofwords
for column names.
Classification datasets
For datasets that comes with a testing and training dataset. Let the
user pick which one to retrieve with a split
argument
similar to how dataset_ag_news()
is doing.