HuCB Hungarian CommitmentBank Corpus
The HuCommitmentBank consists of short text fragments in which at least one sentence contains a subordinating clause, which is syntactically subordinated to a logical inference-cancelling operator. In the database, the premise is the complete text fragment and the hypothesis is the embedded tag clause. In the inference task, it is necessary to decide to what extent the author of the text is committed to the truth of the subordinate clause. The corpus consists of a training, a validation and a test set (of 250, 103 and 250 examples, respectively).
HuCOLA Hungarian Corpus of Linguistic Acceptability
The corpus contains 9 076 Hungarian sentences labelled for their acceptability/grammaticality (0/1).
The sentences were collected by two human annotators from three linguistic books. Each sentence was annotated by four human annotators. The final label of the sentence is the one assigned by the majority of the annotators.
The proportion of train, validation and test sets is 80% (7 276 sentences), 10% (900 sentences) and 10% (900 sentences), respectively.
HuCoPa Hungarian Choice of Plausible Alternatives Corpus
The dataset contains 1 000 instances. Each instance is composed of a premise and two alternatives. The task is to select the alternative that describes a situation standing in causal relation to the situation described by the premise. The corpus was created by translating and re-annotating the original English CoPA corpus. The train, validation and test sets contain 400, 100 and 500 instances, respectively.
HuRTE Hungarian Recognizing Textual Entailment dataset
The dataset contains 4 504 instances. Each example contains a (sometimes multi-sentence) premise and a one-sentence hypothesis, and the task is to decide whether the former entails the latter or not. The corpus was created by translating and re-annotating the instances of the RTE datasets that are part of the GLUE benchmark.
The train, validation and test sets contain 2 131, 242 and 2 131 instances, respectively.
HuSST Hungarian version of the Stanford Sentiment Treebank
The dataset contains 11 683 sentences. Each sentence is annotated for its sentiment on a three-point scale. The corpus was created by translating and re-annotating the full sentences of the SST.
The train, validation and test sets contain 9 347, 1 168 and 1 168 sentences, respectively.
HuWNLI Anaphora resolution datasets for Hungarian as an inference task
This is a Hungarian dataset of anaphora resolution, designed as a sentence pair classification task of natural language inference. Its base, the HuWS corpus was created by translating and manually curating the original English Winograd schemata. The NLI format created by replacing the ambiguous pronoun with each possible referent in the schemata. We extended the set of sentence pairs derived from the schemata by the translation of the sentence pairs that build up the WNLI dataset of GLUE. The data is distributed in three splits: training set (562), development set (59) and test set (134).