Parameterized Synthetic Text Generation with SimpleStories
Parameterized Synthetic Text Generation with SimpleStories
We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million samples each in English and Japanese. Through parameterizing prompts at multiple levels of abstraction, we achieve control over story characteristics at scale, inducing syntactic and semantic diversity. Ablations on a newly trained model suite show improved sample efficiency and model interpretability compared to the TinyStories dataset. We open-source all constituent parts of model creation, hoping to enable novel ways to study the end-to-end training process. As a byproduct, we move the frontier regarding the fewest-parameter language model that outputs grammatical natural language.
Thomas Dooms、Mat Allen、Emerald Zhang、Juan Diego Rodriguez、Noa Nabeshima、Thomas Marshall、Lennart Finke、Chandan Sreedhara、Dan Braun
语言学常用外国语
Thomas Dooms,Mat Allen,Emerald Zhang,Juan Diego Rodriguez,Noa Nabeshima,Thomas Marshall,Lennart Finke,Chandan Sreedhara,Dan Braun.Parameterized Synthetic Text Generation with SimpleStories[EB/OL].(2025-04-12)[2025-06-22].https://arxiv.org/abs/2504.09184.点此复制
评论