The TREC dataset is dataset for question classification consisting of open-domain, fact-based questions divided into broad semantic categories. It has both a six-class (TREC-6) and a fifty-class (TREC-50) version. Both have 5,452 training examples and 500 test examples, but TREC-50 has finer-grained labels. Models are evaluated based on accuracy.
Arguments
- dir
Character, path to directory where data will be stored. If
NULL
, user_cache_dir will be used to determine path.- split
Character. Return training ("train") data or testing ("test") data. Defaults to "train".
- version
Character. Version 6("6") or version 50("50"). Defaults to "6".
- delete
Logical, set
TRUE
to delete dataset.- return_path
Logical, set
TRUE
to return the path of the dataset.- clean
Logical, set
TRUE
to remove intermediate files. This can greatly reduce the size. Defaults to FALSE.- manual_download
Logical, set
TRUE
if you have manually downloaded the file and placed it in the folder designated by running this function withreturn_path = TRUE
.
Value
A tibble with 5,452 or 500 rows for "train" and "test" respectively and 2 variables:
- class
Character, denoting the class
- text
Character, question text
Details
The classes in TREC-6 are
ABBR - Abbreviation
DESC - Description and abstract concepts
ENTY - Entities
HUM - Human beings
LOC - Locations
NYM - Numeric values
the classes in TREC-50 can be found here https://cogcomp.seas.upenn.edu/Data/QA/QC/definition.html.
See also
Other topic:
dataset_ag_news()
,
dataset_dbpedia()
Examples
if (FALSE) {
dataset_trec()
# Custom directory
dataset_trec(dir = "data/")
# Deleting dataset
dataset_trec(delete = TRUE)
# Returning filepath of data
dataset_trec(return_path = TRUE)
# Access both training and testing dataset
train_6 <- dataset_trec(split = "train")
test_6 <- dataset_trec(split = "test")
train_50 <- dataset_trec(split = "train", version = "50")
test_50 <- dataset_trec(split = "test", version = "50")
}