|国家预印本平台
首页|Parameterized Synthetic Text Generation with SimpleStories

Parameterized Synthetic Text Generation with SimpleStories

Parameterized Synthetic Text Generation with SimpleStories

来源:Arxiv_logoArxiv
英文摘要

We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million samples each in English and Japanese. Through parameterizing prompts at multiple levels of abstraction, we achieve control over story characteristics at scale, inducing syntactic and semantic diversity. Ablations on a newly trained model suite show improved sample efficiency and model interpretability compared to the TinyStories dataset. We open-source all constituent parts of model creation, hoping to enable novel ways to study the end-to-end training process. As a byproduct, we move the frontier regarding the fewest-parameter language model that outputs grammatical natural language.

Thomas Dooms、Mat Allen、Emerald Zhang、Juan Diego Rodriguez、Noa Nabeshima、Thomas Marshall、Lennart Finke、Chandan Sreedhara、Dan Braun

语言学常用外国语

Thomas Dooms,Mat Allen,Emerald Zhang,Juan Diego Rodriguez,Noa Nabeshima,Thomas Marshall,Lennart Finke,Chandan Sreedhara,Dan Braun.Parameterized Synthetic Text Generation with SimpleStories[EB/OL].(2025-04-12)[2025-06-22].https://arxiv.org/abs/2504.09184.点此复制

评论