[Updated on 2021-02-01: Updated to version 2.0 with several work added and many typos fixed.]
[Updated on 2021-05-26: Add P-tuning and Prompt Tuning in the “prompt design” section.]
[Updated on 2021-09-19: Add “unlikelihood training”.]
There is a gigantic amount of free text on the Web, several magnitude more than labelled benchmark datasets. The state-of-the-art language models (LM) are trained with unsupervised Web data in large scale. When generating samples from LM by iteratively sampling the next token, we do not have much control over attributes of the output text, such as the topic, the style, the sentiment, etc. Many applications would demand a good control over the model output. For example, if we plan to use LM to generate reading materials for kids, we would like to guide the output stories to be safe, educational and easily understood by children.
How to steer a powerful unconditioned language model? In this post, we will delve into several approaches for controlled content generation with an unconditioned langage model. Note that model steerability is still an open research question. Each introduced method has certain pros & cons.
- Apply guided decoding strategies and select desired outputs at test time.
- Optimize for the most desired outcomes via good prompt design.
- Fine-tune the base model or steerable layers to do conditioned content generation.
In the following discussion, we assume we have access to a pretrained generative language model
Decoding Strategies
By adopting different decoding methods, we can place restrictions or preferences on the sampling process to alter the generated samples without modifying any model weights. Even though decoding strategies do not change the values of any trainable parameter, it is a quite important component.
Common Decoding Methods
Since the final layer of the model predicts logits
A low temperature would make the distribution sharper and a high value makes it softer.
Greedy search: Always pick the next token with the highest probability, equivalent to setting temperature
Beam search: It essentially does breadth-first search, one token per tree level, but with a limited bandwidth. At each level of the search tree, beam search keeps track of
However, maximization-based decoding does not guarantee high-quality generation.

Top-k sampling (Fan et al., 2018): At each sampling step, only the top
Nucleus sampling (Holtzman et al. 2019): Also known as “Top-p sampling”. One drawback of top-k sampling is that the predefined number
Both top-k and nucleus sampling have less repetitions with a proper set of hyperparameters.
Penalized sampling (Keskar et al. 2019): To avoid the common failure case of generating duplicate substrings, the CTRL paper proposed a new sampling method to penalize repetitions by discounting the scores of previously generated tokens. The probability distribution for the next token with repetition penalty is defined as:
where
Guided Decoding
All the above standard decoding strategies sample tokens according to the predicted probability, with no additional information. Our preferences on topic or sentiment can be baked into the candidate ranking function to guide the sample generation by altering the candidate ranking score. The ranking score for token selection at each decoding step can be set as a combination of LM log-likelihood and a set of desired feature discriminators. The features are designed to quantify human preferences by heuristics (Ghazvininejad et al., 2017), supervised learning (Holtzman et al., 2018) or RL (Li et al., 2017).
Ghazvininejad et al. (2017) built a system called “Hafez” for generating poetry in desired style by adjusting sampling weights in beam search at decoding steps. The likelihood of sampling for the next token
where
- whether
exists in a bag of desired or banned topical words. - whether
indicates certain sentiments. - whether
is a repeated token (and thus needs to take the history as input too). - the length of
if longer or shorter words are in particular preferred.
Similar to Hafez, Baheti et al. (2018) manually designed features for ranking and altered the sampling distribution by appending similarity scores between topic distribution or embeddings of the context and the completion.
Holtzman et al. (2018) adopted a set of learned discriminators, each specializing in a different principle of communication guided by Grice’s maxims: quality, quantity, relation and manner. The discriminators learn to encode these desired principles by measuring repetition, entailment, relevance, and lexical diversity, respectively. Given some ground truth completion, all the discriminator models are trained to minimize the ranking log-likelihood,
Meister et al. (2020) studied beam search in a regularized decoding framework:
Since we expect maximum probability to have minimum surprise, the surprisal of a LM at time step
The MAP (maximum a posteriori) part demands for sequences with maximum probability given context, while the regularizer introduces other constraints. It is possible a global optimal strategy may need to have a high-surprisal step occasionally so that it can shorten the output length or produce more low-surprisal steps afterwards.
Beam search has gone through the test of time in the field of NLP. The question is: If we want to model beam search as exact search in a regularized decoding framework, how should
“The uniform information density hypothesis (UID; Levy and Jaeger, 2007) states that—subject to the constraints of the grammar—humans prefer sentences that distribute information (in the sense of information theory) equally across the linguistic signal, e.g., a sentence.”
In other words, it hypothesizes that humans prefer text with evenly distributed surprisal. Popular decoding methods like top-k sampling or nuclear sampling actually filter out high-surprisal options, thus implicitly encouraging the UID property in output sequences.
The paper experimented with several forms of regularizers:
- Greedy:
; if set , we have greedy search. Note that being greedy at each individual step does not guarantee global optimality. - Variance regularizer:
, where is the average surprisal over all timesteps. It directly encodes the UID hypothesis. - Local consistency:
; this decoding regularizer encourages adjacent tokens to have similar surprisal. - Max regularizer:
penalizes the maximum compensation of surprisal. - Squared regularizer:
encourages all the tokens to have surprisal close to 0.
An experiment with greedy regularizers showed that larger

A default beam search would have text generation of decreased quality when beam size increases. Regularized beam search greatly helps alleviate this issue. A combined regularizer further improves the performance. In their experiments for NMT, they found

Guided decoding essentially runs a more expensive beam search where the sampling probability distribution is altered by side information about human preferences.
Trainable Decoding
Given a trained language model, Gu et al (2017) proposed a trainable greedy decoding algorithm to maximize an arbitrary objective for sampling sequences. The idea is based on the noisy, parallel approximate decoding (NPAD). NPAD injects unstructured noise into the model hidden states and runs noisy decoding multiple times in parallel to avoid potential degradation. To take a step further, trainable greedy decoding replaces the unstructured noise with a learnable random variable, predicted by a RL agent that takes the previous hidden state, the previous decoded token and the context as input. In other words, the decoding algorithm learns a RL actor to manipulate the model hidden states for better outcomes.
Grover et al. (2019) trained a binary classifier to distinguish samples from data distribution and samples from the generative model. This classifier is used to estimate importance weights for constructing a new unnormalized distribution. The proposed strategy is called likelihood-free importance weighting (LFIW).
Let
However,
Then if
where
Since we cannot learn a perfect optimal classifier, the importance weight would be an estimation
- Self-normalization: normalize the weight by the sum
. - Flattening: add a power scaling parameter
, . - Clipping: specify a lower bound
.
To sample from an importance resampled generative model,

Deng et al., 2020 proposed to learn a EBM to steer a LM in the residual space,
The goal is to learn the parameters of the energy function
However, the partition function is intractable in practice. The paper proposed a simple way to first sample from the original LM and then to resample from them according to the energy function. This is unfortunately quite expensive.

Smart Prompt Design
Large language models have been shown to be very powerful on many NLP tasks, even with only prompting and no task-specific fine-tuning (GPT2, GPT3. The prompt design has a big impact on the performance on downstream tasks and often requires time-consuming manual crafting. For example, factual questions can gain a big boost with smart prompt design in “closed-book exam” (Shin et al., 2020, Jiang et al., 2020)). I’m expecting to see an increasing amount of literature on automatic smart prompt design.
Gradient-based Search
AutoPrompt (Shin et al., 2020; code) is a method to automatically create prompts for various tasks via gradient-based search. AutoPrompt constructs a prompt by combining the original task inputs

The universal trigger tokens are identified using a gradient-guided search strategy same as in Wallace et al., 2019. The universal setting means that the trigger tokens
The search operates in the embedding space. The embedding of every trigger token
where

The above token replacement method can be augmented with beam search. When looking for the optimal token embedding

Smart prompt design essentially produces efficient context that can lead to desired completion. Motivated by this observation, Li & Liang (2021) proposed Prefix-Tuning which assigns a small number of trainable parameters at the beginning of an input sequence (named “prefix”) to steer a LM,
Note that only

The prefix parameters do not tie to any embeddings associated with the real words and thus they are more expressive for steering the context. Direct optimizing
The performance increases with the prefix length

A few other interesting learnings from their ablation studies include:
- Tuning only the embedding layer (without prefix) is not sufficiently expressive.
- Placing the trainable parameter between
and , , slightly underperforms prefix-tuning, likely because it only affects the context for while prefix affects both. - Random initialization of
leads to low performance with high variance. In contrast, initializing with activations of real words improves generation, even the words are irrelevant to the task.
Fine-tuned models achieve better task performance but they can fail in the low data regime. Both AutoPrompt and Prefix-Tuning were found to outperform fine-tuning in the regime where the training dataset is small (i.e.
Two successive works, P-tuning (Liu et al. 2021; code) and Prompt Tuning (Lester et al. 2021), follow the similar idea of explicit training continuous prompt embeddings but with a few different choices over the trainable parameters and architecture. Different from Prefix-Tuning which concatenates continuous prompt tokens in every hidden state layer of the transformer, both P-tuning and Prompt Tuning non-invasively add continuous prompts only in the input to work well.
Let

There are two major optimization challenges in P-tuning:
- Discreteness: The word embedding of a pretrained language model are highly discrete. It is hard to optimize
if they are intialized at random. - Association:
should be dependent on each other. Thus they develop a mechanism to model this dependency by training a light-weighted LSTM-based prompt encoder:
P-tuning is more flexible than prefix-tuning, as it inserts trainable tokens in the middle of a prompt not just at the beginning. The usage of task-specific anchor tokens is like combining manual prompt engineering with trainable prompts.
Prompt Tuning (Lester et al. 2021) largely simplifies the idea of prefix tuning by only allowing an additional
- Prompt tuning produces competitive results as model fine-tuning when the model gets large (billions of parameters and up). This result is especially interesting given that large models are expensive to fine-tune and execute at inference time.
- With learned task-specific parameters, prompt tuning achieves better transfer learning when adapting to new domains. It outperforms fine-tuning on domain shift problems.
- They also showed that prompt ensembling of multiple prompts for the same task introduces further improvement.

The experiments investigated several prompt initialization schemes:
- Random initialization by uniformly sampling from [-0.5, 0.5];
- Sample embeddings of top 5000 common tokens;
- Use the embedding values of the class label strings. If we don’t have enough class labels to initialize the soft-prompt, we fall back to scheme 2. Random initialization performs noticeably worse than the other two options.

The pre-training objectives also have a big impact on the quality of prompt tuning. T5’s “span corruption” is not a good option here.
Prompt tuning is found to be less likely to overfit to a specific dataset. To evaluate the robustness to data shifting problem, they trained the model on one dataset of one task and evaluated it on the test dataset but in a different domain. Prompt tuning is more resilient and can generalize to different domains better.

Heuristic-based Search
Paraphrasing is a quick way to explore more prompts similar to the known version, which can be done via back-translation. Using back-translation, the initial prompt is translated into
Ribeiro et al (2018) identified semantically equivalent adversaries (SEA) by generating a variety of paraphrases
where the score
Examples of SEA rules include (What NOUN
→Which NOUN
), (WP
is → WP
’s’), (was→is), etc. They are considered as “bugs” in the model. Applying those rules as data augmentation in model training helps robustify the model and fix bugs.
Jiang et al (2020) attempts to validate whether a trained language model knows certain knowledge by automatically discovering better prompts to query. Within the scope of knowledge retrieval where factual knowledge is represented in the form of a triple
Interestingly some small modifications in the prompts may lead to big gain, as shown in Fig. X.

Fine-tuning
Fine-tuning is an intuitive way to guide a LM to output desired content, commonly by training on supervised datasets or by RL. We can fine-tune all the weights in the model or restrict the fine-tuning to only top or additional layers.
Conditional Training
Conditional training aims to learn a generative model conditioned on a control variable
Fan et al (2018) trained a conditional language model for 2-step story generation. First, a model outputs the story sketch and then a story writing model creates a story following that sketch. The mechanism of conditioning on the sketch is implemented by a fusion model architecture. The fusion model enforces a form of residual learning that allows the story writing model to focus on learning what the first sketch generation model is missing. Also for story generation, Peng et al (2018) experimented with an ending valence-conditioned story generator LM,
CTRL (Keskar et al., 2019; code) aims to train a language model conditioned control code [horror]
, [legal]
, etc. Then the learned model is able to generate text with respect to the prompt prefix. The training data contains Wikipedia, OpenWebText, books, Amazon reviews, reddit corpus and many more, where each dataset is assigned with a control code and subreddit in the reddit corpus has its own topic as control code.

The control code also can be used for domain annotation given tokens, because

Note that CTRL trains a transformer model from scratch. However, labelling all the text within the same dataset with the same control code (e.g. All the wikipedia articles have “wikipedia” as control code) feels quite constrained. Considering that often we need highly customized control codes but only have a limited amount of labelled data, I would expect fine-tuning an unconditional LM with a small labelled dataset in the same way as CTRL to work out well too. Although how much data is needed and how good the sample quality might be are subject to experimentation.
RL Fine-tuning
Fine-tuning a sequential model with RL regarding any arbitrary and possibly non-differentiable reward function has been proved to work well years ago (Ranzato et al., 2015). RL fine-tuning can resolve several problems with teacher forcing method. With teacher forcing, the model only minimizes a maximum-likelihood loss at each individual decoding step during training but it is asked to predict the entire sequence from scratch at test time. Such a discrepancy between train and test could lead to exposure bias and accumulated error. In contrast, RL fine-tuning is able to directly optimize task-specific metrics on the sequence level, such as BLEU for translation (Ranzato et al., 2015, Wu et al., 2016, Nguyen et al., 2017), ROUGE for summarization (Ranzato et al., 2015, Paulus et al., 2017, Wu and Hu, 2018) and customized metric for story generation (Tambwekar et al., 2018).
Ranzato et al (2015) applied REINFORCE to train RNN models for sequence generation tasks. The model is first trained to predict the next token using cross-entropy loss (ML loss) and then fine-tuned alternatively by both ML loss and REINFORCE (RL loss). At the second fine-tuning stage, the number of training steps for next-token prediction is gradually decreasing until none and eventually only RL loss is used. This sequence-level RL fine-tuning was shown by experiments to lead to great improvements over several supervised learning baselines back then.
Google implemented the similar approach in their neural machine translation system (Wu et al., 2016) and Paulus et al (2017) adopted such approach for summarization task. The training objective contains two parts, ML loss for next token prediction,
The RL loss of Google NMT is to maximize the expected BLEU score:
where
Paulus et al (2017) added an extra weighting term based on the reward difference between two output sequences,
RL Fine-tuning with Human Preferences
Reward learning is critical for defining human preferences. Quantitative measurement like BLEU or ROUGE computes the overlap of words and n-gram phrases between sequences and does not always correlate with better quality by human judges. Reward learning from human feedback (Christiano et al., 2017) is a better way to align what we measure with what we actually care about. Human feedback has been applied to learn a reward function for applications like story generation (Yi et al., 2019) and summarization (Böhm et al., 2019, Ziegler et al., 2019, Stiennon et al., 2020).
In order to generate more coherent conversation, Yi et al (2019) collected 4 types of binary human feedback given a conversation pair (user utterance, system response), whether the system response is (1) comprehensive, (2) on topic, (3) interesting and (4) leading to continuation of the conversation. An evaluator is trained to predict human feedback and then is used to rerank the beam search samples, to finetune the model or to do both. (Actually they didn’t use RL fine-tuning but rather use the evaluator to provide a discriminator loss in supervised fine-tuning.)
Let’s define a learned reward function
To learn the ground truth reward
(1) Regression loss: simply minimizing the mean squared error.
(2) Preference loss: learning to agree with the ground truth reward,
Their experiments showed that the preference loss achieves the best performance, where the reward model is a thin MLP layer on top of BERT sentence embedding.
Ziegler et al (2019) collected human labels by asking humans to select the best candidate

The reward model is implemented by a pretrained language model with an extra random linear layer of the final embedding output. It it trained to minimize the loss:
To keep the scale consistent during training, the reward model is normalized to have mean 0 and variance 1.
During RL fine-tuning, the policy
If running online data collection, human label collection process is continued during RL fine-tuning and thus the human labelers can review results generated by the latest policy. The number of human labels are evenly spread out during the training process. Meanwhile the reward model is also retrained periodically. Online data collection turns out to be important for the summarization task but not for the text continuation task. In their experiments, jointly training the reward model and the policy with shared parameters did not work well and can lead to overfitting due to the big imbalance between dataset sizes.
In the following work (Stiennon et al., 2020), the human label collection was further simplified to select the best option between a pair of summaries,

Guided Fine-tuning with Steerable Layer
Instead of fine-tuning the entire model, only fine-tuning a small extra set of parameters while the base model stays fixed is computationally cheaper.
In computer vision, plug-and-play generative networks (PPGN; Nguyen et al., 2017) generate images with different attributes by plugging a discriminator
Given an attribute
- One toward higher log-likelihood of the attribute
under — so that the output content acquires a desired attribute. - The other toward higher log-likelihood of the unmodified language model
— so that the generated text is still in fluent and smooth natural language.
To shift the output, at decoding time, PPLM runs one forward → one backward → one forward, three passes in total:
- First a forward pass is performed to compute the likelihood of attribute
by ; - Let
be a stepwise update to the hidden state such that shifts the distribution of generated text closer to having the attribute . is initialized at zero. Then a backward pass updates the LM hidden states using normalized gradients from the attribute model as
where

Multiple attribute models can be mix-and-matched during generation with customized weights, acting as a set of “control knobs”. The PPLM paper explored two types of attribute models:
- The simplest attribution model is based on a predefined bag of words (BoW),
, that specifies a topic of interest.
To encourage the model to output the desired words at least once but not at every step, they normalize the gradient by the maximum gradient norm.
Interestingly, they found that increasing the probability of generating words in the bag also increases the probability of generating related but not identical words about the same topic.
2. The discriminator attribute models are based on learned classifiers which define preferences by a distribution instead of hard samples.
To ensure the fluency in language, PPLM applied two additional designs:
- Minimizing the KL diverge between modified and unmodified LM, commonly seen in other RL fine-tuning approaches (see above).
- It performs post-norm fusion to constantly tie the generated text to the unconditional LM
, , where and are the unmodified and modified output distributions, respectively. is a normalizing factor. balances between prediction from before and after models.

Interestingly, they found a large variance in the extent of controllability across topics. Some topics (religion, science, politics) are easier to control for compared to others (computers, space).
One obvious drawback of PPLM is that due to multiple passes at every decoding step, the test time computation becomes much more expensive.
Similar to PPLM, DELOREAN (DEcoding for nonmonotonic LOgical REAsoNing; Qin et al., 2020) incorporates the future context by back-propagation. Given input text
Given the representation
- Backward: The constraint is represented as a loss function
. The logits are updated via gradient descent: . - Forward: Run forward pass to ensure the generated text is fluent.
. - Then linearly combine two logits together to create a new representation
. Note that each is needed to sample the next .
Side-tuning (Zhang et al., 2019) trains a light-weighted side network that learns a residual on top of the original model outputs without modifying the pre-trained model weights. Unlike PPLM, no gradient update is applied on the hidden states. It is a simple yet effective approach for incremental learning. The base model is treated as a black-box model and does not necessarily have to be a neural network. Side-tuning setup assumes the base and side models are fed exactly the same input and the side model is independently learned.

The paper explored different strategies of fusing predictions from the base and side models: product
is the worst while sum
(
Auxiliary tuning (Zeldes et al., 2020) supplements the original pre-trained model with an auxiliary model that shifts the output distribution according to the target task. The base and auxiliary model outputs are merged on the logits level. The combined model is trained to maximize the likelihood
The conditional probability of
assigns high probabilities to fluent sequences of tokens;- a shift on
towards .
By Bayesian rule, we have
And therefore the auxiliary model

GeDi (Kruse et al., 2020) guides the text generation by Generative Discriminator. The discriminator is implemented as a class conditional language model (CC-LM),
- One conditioned on the control code
for desired attribute. - The other conditioned on the anti-control code
for undesired attributes.
GeDi relies on the contract between
where

They finetuned a GPT2-medium model with control code similar to how CTRL is trained to form a CC-LM using a linear combination of discriminative loss and generative loss. This discriminator model is then used as GiDe to guide generation by a larger language model like GPT2-XL.
One way of decoding from GeDi is to sample from a weighted posterior
GeDi guided generation in their experiments showed strong controllability and ran 30x faster than PPLM.
Distributional Approach
Generation with Distributional Control (GDC; Khalifa, et al. 2020) frames controlled text generation as the optimization of a probability distribution with a constraint. It involves two major steps.
Step 1: Learn a EBM of the target model
Let’s label a pretrained LM as
In summary, given a pretrained model
where
According to theorems in Information Geometry,
Let’s define importance weight
Using SGD over the objective
Step 2: Learn the target probability distribution
The EBM
To learn such a

This approach can be used to model various constraints in controllable text generation:
- Pointwise constraints:
is a binary feature; such as constraining the presence or absence of words, or classifier-based constraints. - Distributional constraints:
represents a probability distribution; such as constraining the probability of gender, topic, etc. Their experiments showed great progress in debiasing a GPT-2 model that was trained on Wikipedia Biographies corpus. The percentage of generated biographies on females increased from 7.4% to 35.6%. - Hybrid constraints: combine multiple constraints by simply summing them up.

Compared to other baselines, GDC using pointwise constraints diverges less from the base model

- REINFORCE that optimizes the reward
directly ( in Fig. X.) without constraints converges fast but has a high deviation from the original model. - REINFORCE that optimizes
( in Fig. X.) has low sample diversity. - Compared to Ziegler et al., 2019 GDC has smoother learning curves and produces a richer vocabulary.
Unlikelihood Training
The standard way of maximizing the log-likelihood loss in language model training leads to incorrect token distribution, which cannot be fixed with only smart decoding methods. Such models tend to output high-frequency words too often and low-frequency words too rarely, especially when using deterministic decoding (e.g. greedy, beam search). In other words, they are overconfident in their predictions.
Unlikelihood training (Welleck & Kulikov et al. 2019] tries to combat this and incorporates preference to unwanted content into the training objective directly. It combines two updates:
- A routine maximized likelihood update to assign true tokens with high probability;
- A new type of unlikelihood update to avoid unwanted tokens with high probability.
Given a sequence of tokens
One approach for constructing
The unlikelihood training can be extended to be on the sequence-level, where the negative continuation is defined by a sequence of per-step negative candidate sets. They should be designed to penalize properties that we don’t like. For example, we can penalize repeating n-grams as follows:
Their experiments used unlikelihood training to avoid repetitions in language model outputs and indeed showed better results on less repetition and more unique tokens compared to standard MLE training.
Citation
Cited as:
Weng, Lilian. (Jan 2021). Controllable neural text generation. Lil’Log. https://lilianweng.github.io/posts/2021-01-02-controllable-text-generation/.
Or
@article{weng2021conditional,
title = "Controllable Neural Text Generation.",
author = "Weng, Lilian",
journal = "lilianweng.github.io",
year = "2021",
month = "Jan",
url = "https://lilianweng.github.io/posts/2021-01-02-controllable-text-generation/"
}
References
[1] Patrick von Platen. “How to generate text: using different decoding methods for language generation with Transformers” Hugging face blog, March 18, 2020.
[2] Angela Fan, et al. “Hierarchical Neural Story Generation/” arXiv preprint arXiv:1805.04833 (2018).
[3] Ari Holtzman et al. “The Curious Case of Neural Text Degeneration.” ICLR 2020.
[4] Marjan Ghazvininejad et al. “Hafez: an interactive poetry generation system.” ACL 2017.
[5] Ari Holtzman et al. “Learning to write with cooperative discriminators.” ACL 2018.
[6] Ashutosh Baheti et al. “Generating More Interesting Responses in Neural Conversation Models with Distributional Constraints.” EMNLP 2018.
[7] Jiatao Gu et al. “Trainable greedy decoding for neural machine translation.” EMNLP 2017.
[8] Kyunghyun Cho. “Noisy Parallel Approximate Decoding for Conditional Recurrent Language Model.” arXiv preprint arXiv:1605.03835. (2016).
[9] Marco Tulio Ribeiro et al. “Semantically equivalent adversarial rules for debugging NLP models.” ACL 2018.
[10] Eric Wallace et al. “Universal Adversarial Triggers for Attacking and Analyzing NLP.” EMNLP 2019. [code]
[11] Taylor Shin et al. “AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts.” EMNLP 2020. [code]
[12] Zhengbao Jiang et al. “How Can We Know What Language Models Know?” TACL 2020.
[13] Nanyun Peng et al. “Towards Controllable Story Generation.” NAACL 2018.
[14] Nitish Shirish Keskar, et al. “CTRL: A Conditional Transformer Language Model for Controllable Generation” arXiv preprint arXiv:1909.05858 (2019).[code]
[15] Marc’Aurelio Ranzato et al. “Sequence Level Training with Recurrent Neural Networks.” ICLR 2016.
[16] Yonghui Wu et al. “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.” CoRR 2016.
[17] Romain Paulus et al. “A Deep Reinforced Model for Abstractive Summarization.” ICLR 2018.
[18] Paul Christiano et al. “Deep Reinforcement Learning from Human Preferences.” NIPS 2017.
[19] Sanghyun Yi et al. “Towards coherent and engaging spoken dialog response generation using automatic conversation evaluators.” INLG 2019.
[20] Florian Böhm et al. “Better rewards yield better summaries: Learning to summarise without references.” EMNLP 2019. [code]
[21] Daniel M Ziegler et al. “Fine-tuning language models from human preferences.” arXiv preprint arXiv:1909.08593 (2019). [code]
[22] Nisan Stiennon, et al. “Learning to summarize from human feedback.” arXiv preprint arXiv:2009.01325 (2020).
[23] Sumanth Dathathri et al. “Plug and play language models: a simple approach to controlled text generation.” ICLR 2020. [code]
[24] Jeffrey O Zhang et al. “Side-tuning: Network adaptation via additive side networks” ECCV 2020.
[25] Ben Kruse et al. “GeDi: Generative Discriminator Guided Sequence Generation.” arXiv preprint arXiv:2009.06367.
[26] Yoel Zeldes et al. “Technical Report: Auxiliary Tuning and its Application to Conditional Text Generatio.” arXiv preprint arXiv:2006.16823.
[27] Thomas Scialom, et al. “Discriminative Adversarial Search for Abstractive Summarization” ICML 2020.
[28] Clara Meister, et al. “If beam search is the answer, what was the question?” EMNLP 2020.
[29] Xiang Lisa Li and Percy Liang. “Prefix-Tuning: Optimizing Continuous Prompts for Generation.” arXiv preprint arXiv:2101.00190 (2021).
[30] Lianhui Qin, et al. “Back to the Future: Unsupervised Backprop-based Decoding for Counterfactual and Abductive Commonsense Reasoning.” arXiv preprint arXiv:2010.05906 (2020).
[31] Muhammad Khalifa, et al. “A Distributional Approach to Controlled Text Generation” Accepted by ICLR 2021.
[32] Aditya Grover, et al. “Bias correction of learned generative models using likelihood-free importance weighting.” NeuriPS 2019.
[33] Yuntian Deng et al. “Residual Energy-Based Models for Text Generation.” ICLR 2020.
[34] Brian Lester et al. “The Power of Scale for Parameter-Efficient Prompt Tuning.” arXiv preprint arXiv:2104.08691 (2021).
[35] Xiao Liu et al. “GPT Understands, Too.” arXiv preprint arXiv:2103.10385 (2021).
[36] Welleck & Kulikov et al. “Neural Text Generation with Unlikelihood Training” arXiv:1908.04319 (2019).