FABLE: Fiction Adapted BERT for Literary Entities

INSIDE SALAD

FABLE: Fiction Adapted BERT for Literary Entities

Published: September 10, 2025

Maksim Gorkii

Announcing FABLE: Fiction Adapted BERT for Literary Entities

Today Salad releases our first open model, FABLE, a named-entity recognition model based on DeBERTa v3 specializing in narrative fiction. We also release the datasets it was trained on, Fiction-1B and Fiction-NER-750M, spanning more than 200 years of fiction and literary style including classics as well as contemporary fanfiction. The model as well as the datasets are released under the MIT license. Here we’ll go into why we did this, the unique challenges of the domain, how we synthesized the dataset on Salad’s Community GPU Cloud, and how we did our training on Salad’s Secure Datacenter Cloud.

The Task: Named Entity Recognition in Fiction

In machine learning, Named Entity Recognition, or NER, is the task of determining which words in a span of text are named entities. For example, in the sentence “King Arthur sat in his castle at Camelot, discussing plans with Merlin. His sword Excalibur leaned against the stone wall nearby,” our named entities are as follows:

King Arthur – Character
Camelot – Location
Merlin – Character
Excalibur – Object

A quick scan of Huggingface will tell you there are many NER models already available, so you may at first dismiss this as a solved problem. All of the existing NER models have the same few problems:

They have not been trained on fiction. They are trained on things like Wikipedia, financial reports, and news, which have very different syntax patterns than narrative fiction. This limits the ability of these models to successfully generalize to narrative fiction, and they tend to perform quite poorly at the task.
Existing NER datasets are built from Wikipedia, financial reports, and news, leading to the types of models described above.
Existing NER datasets do not track the types of entities found in narrative fiction. For instance, a named object like “Excalibur” is not going to be picked up by a model trained to find financial institutions and geopolitical entities. Even common NER tags such as “Person” inadequately capture what a “Character” is. For example, R2-D2 is a character in star wars, but arguably not a person.

However, despite these shortcomings, we know language models such as DeBERTa v3 perform extremely well at NER in general, so we can feel fairly confident that we could train an extremely effective NER model if the dataset we needed existed.

The Strategy

We actually want to create two datasets to get the best outcome on this:

A large corpus of clean narrative fiction text. We gathered around 1 billion words from publicly available fiction over 20 thousand documents from multiple sources.
A smaller, but still large corpus of clean, diverse narrative fiction text, tagged with entity labels.

The first dataset gives us a large and diverse corpus of fiction upon which to build our NER dataset.

The second dataset will be created by annotating the the first dataset with a small language model, and sampling the highest quality examples.

Fiction 1B – A Billion Words

I sourced more than 20,000 English-language documents from Project Gutenberg, AO3, and the Internet Archive, representing a wide range of styles from classic Victorian literature all the way through contemporary fan-fiction. We collected the initial raw dataset using python scripts to download the content gradually, in accordance with request rate limits set by the sources.

Project Gutenberg hosts a catalog CSV that includes metadata such as title, author, and subjects. We filtered based on the presence of fiction-related keywords in the Subjects column.

fiction_keywords = [
    'fiction', 'novel', 'stories', 'tale', 'adventure',
    'mystery', 'romance', 'fantasy', 'horror', 'detective',
    'science fiction', 'historical fiction', 'western',
    'thriller', 'suspense'
]

For AO3, we used the ao3-api python package to gradually paginate through the archive, filtering to English language work with at least 15,000 words but fewer than 500,000, sorted by “Kudos”, a measure of user favor.

For Internet Archive, we used their search endpoint, and a significant amount of keyword filtering. Ultimately we did not get much content from this source due to licensing restrictions.

From a somewhat larger initial set of documents, we used a Genre Classifier model based on RoBERTa to filter out content which was not prose, and used simple text matching to filter out content that was copyrighted or otherwise licensed in a way that did not permit our use. We also discarded documents under 1000 words. This process removed about 30 million words.

The final selection of 20,000 documents was uploaded to Cloudflare R2 for storage and distribution, and then processed by our LLM worker that I’ll detail later. As part of that process, each document underwent additional cleaning, using the genre classifier model again to discard paragraphs within a document that were not the right type, such as legal text and metadata. This process removed an additional 40 million words, leaving us with a total of 1.02B words that we can have relatively high confidence are of the correct type.

The final split of words by source was:

Project Gutenberg: 76.37%
AO3: 22.20%
Internet Archive: 1.43%

Fiction-NER-750M – 750 Million Entity Labels

We’ll use a large language model, Qwen3 4B Thinking to identify entities in blocks of text, and then use heuristics, pattern-matching, and regular expressions to find those entities within the text, and label tokens with entity categories. We tried several different approaches and several different models before arriving on this solution as the most performant, and the most accurate in sampled documents.

We tried a variety of similarly-sized open models, including the base Qwen3 4B model, LLama 3.2 3B, Phi 4 Mini Reasoning, and found that the “thinking” variant of Qwen3 4B was much more accurate at this task than the other models, though it was fairly slow due to thinking tokens, and would occasionally consume its entire token budget on thinking without generating a result. In these situations, we retried with a slightly modified prompt designed to discourage overthinking.

We evaluated accuracy and tokens-tagged-per-second with blocks of text ranging from 5,000 tokens to 30,000 tokens, and found 25,000 to be a “sweet spot” for both throughput and accuracy, though individual inferences could be quite time-consuming.

I was able to prototype all of this on my laptop RTX 3080ti 16GB, but you can also spin up our Ubuntu Recipe with any GPU available on SaladCloud, defaulting to RTX 3090 24GB.

After quite a bit of fiddling with the prompt, tweaking pattern-matching rules, and many wrongly-tagged tokens, we had something that was working well enough, enough of the time, based on my own judgement reading random samples. We packaged the cleaning and extraction pipeline into a docker container and deployed it to a Container Group on Salad Community Cloud configured to use RTX 3090 and RTX 4090 GPUs. We used Salad Kelpie to queue extraction jobs, and handle autoscaling the container group, so it automatically turned itself off when it finished the batch. A peak of 250 machines worked jobs concurrently, completing the task over a weekend.

As expected from an LLM-based solution, not all documents parsed equally well from the prompts we used. We removed the bottom 10% of documents by entity density, and the top 10%, leaving us with 811M labeled words, divided as follows across our 3 sources:

Project Gutenberg: 75.61%
AO3: 23.41%
IA: 0.98%

Additional filtering removed sections of text within each document that were outliers in terms of entity density. Some sections had 0% entities, while others had 100%, indicating a problem during the LLM process for those sections.

From here, we created pre-tokenized and aligned training examples averaging about 40 tokens in length, using the DeBERTa V2 tokenizer (shared by the v3 model). The dataset includes the original text of each example as well, in case you want to use a different tokenizer. This process yielded about 24 million examples, all of which are included, even though it’s a bit overkill for NER training.

Training FABLE

Fiction text has a greatly broader range of styles and conventions than does news text, presenting meaningful challenges for any model trying to comprehend it. Additionally, the dataset has extremely uneven classes, with non-entity tokens representing 96% of the whole. This means a model could achieve 96% accuracy (but 0% usefulness) by just guessing “Not an Entity” for every single token. For this reason, NER training usually uses an accuracy measure called F1 which combines both recall and precision. Recall measures how many entities from the example were identified, while precision measures how accurate those identifications were. An example of something that shows high recall, but poor precision:

Token	Text	Label
1	King	B-CHA
2	Arthur	I-CHA
3	sat	B-MISC
4	in	B-MISC
5	his	B-MISC
6	castle	B-MISC
7	at	B-MISC
8	Camelot	B-LOC
9	,	B-MISC
10	discussing	B-MISC
11	plans	B-MISC
12	with	B-MISC
13	Merlin	B-CHA
14	.	B-MISC
15	His	B-MISC
16	sword	B-MISC
17	Excalibur	B-OBJ
18	leaned	B-MISC
19	against	B-MISC
20	the	B-MISC
21	stone	B-MISC
22	wall	B-MISC
23	nearby	B-MISC

Here, the model has correctly labeled all of the entities that are present. It get’s 100% recall here, because every single entity present in the text was identified. However, it mislabeled every other token, so it has a very low precision of 21.7%. By targeting F1, we teach the model to balance these factors.

Even within entity tokens, the classes are highly uneven, with far more character tokens than misc, for example. We don’t want the model to be able to score very highly by always predicting character, so we used Focal Loss with token classes weighted exponentially by inverse frequency. This means the model is not rewarded almost at all for predicting a non-entity token correctly, but is rewarded heavily for predicting the rarer classes successfully.

We used hidden dropout and attention dropout to improve robustness at entity-boundary detection, and prevent over-reliance on specific positional patterns. A cosine learning rate scheduler with a brief warmup period helped the model learn the basics of the classification task quickly while performing gradual refinements toward the end of training.

Training was performed on 8xA100 40GB GPUs via Salad’s Secure Cloud using Huggingface Transformers for the training primitives, Datasets for data management, and Accelerate for scheduling training across multiple GPUs. After some experimentation we arrived at a per-device batch-size of 256 examples to use most of the available VRAM but never go over. No fancy job scheduling or additional orchestration, just a jupyter notebook (included in the model repo). We used Weights and Balances to monitor and analyze training performance.

We iterated on basic training hyperparameters using 250k samples, which trained very quickly. Then, we gradually scaled up to 12M training examples, and a validation set of 1.2M. Experimentation showed little to no improvements to be had after a single epoch, and sharply diminishing returns after 4M examples.

Why Do It At All?

In the age of do-it-all large language models, why would we use an older BERT-based architecture? Speed, Cost, and Accuracy. While LLMs are mostly fairly adept at the NER task, as evidenced by us using one to generate our training labels, we also had to throw away fully 25% of the results due to exceptionally poor accuracy. By the end of training, our model consistently found entities missed by the original pipeline, outperforming the training dataset. Additionally, even with our optimized process that used large batches of text and efficient text-matching, we averaged about 50 labels per second per GPU with our Qwen3-based pipeline. By comparison, our DeBERTa-based model can label many thousands of tokens per second, and even has good CPU-only performance. This makes it more suitable for real-time applications, offline processing, and edge devices including phones and tablets.

There are many problems, like NER, that are better solved by traditional ML techniques than by generative AI. However, Gen AI and LLMs can still play a critical role in creating the datasets we need for training, especially with abundant low-cost compute from SaladCloud.

Where To Go Next

The results from this were pretty good, but we could probably get even better performance with two additional procedures:

We could domain-adapt the base DeBERTa model on our Fiction 1B dataset, using Masked-Language modeling, and then training the NER task on top of that. The additional pre-training would teach the model more about the syntax and patterns of narrative fiction, theoretically improving its downstream performance on other tasks related to narrative fiction.
We could re-annotate the NER dataset using our current FABLE, taking only high-confidence predictions, and then re-run training with the newer dataset. The re-labeling could be done quickly on a laptop, since FABLE is very fast and lightweight.

Between these two things, we could probably squeeze out a few more hundredths of F1. Maybe you’ll do them, and release an even better version.

Interested in running your workload on SaladCloud Secure (H100s, A100s, L40S, and more)? Check out SaladCloud Secure.

Have questions about enterprise pricing for SaladCloud?