Self-Refine: Iterative Refinement with Self-Feedback

1Language Technologies Institute, Carnegie Mellon University
2Allen Institute for AI, 3University of Washington, 4NVIDIA, 5UC San Diego 6Google Research, Brain Team

Self-Refine is a novel approach that allows LLMs to iteratively refine outputs and incorporate feedback along multiple dimensions to improve performance on diverse tasks. Unlike prior work, it does not require supervised training data or reinforcement learning, and works with a single LLM.

Self-Refine iteratively improves outputs from LLMs through a process of iterative creation with feedback description.


Framework Description


Given an input x, and an initial output y0, Self-Refine successively refines the output in a FEEDBACK → REFINE → FEEDBACK loop. It is assumed that the initial output y0 is produced by a generator model, which could be a specialized fine-tuned model or a few-shot prompted model.


Components

  • FEEDBACK: receives the initial output y0 and provides feedback on how to enhance it. This feedback is task-dependent and generally addresses multiple aspects of the input. In the above example, the feedback might concern the sentiment level and vividness of the review. Importantly, the feedback is actionable, as the feedback module identifies specific areas that can be refined (e.g., “the sentiment level is neutral” or “the sentiment is neutral due to phrases like good”).

  • REFINE is responsible for refining yt based on the feedback received from the feedback module. In the example, informed by the neutral sentiment of the review due to phrases like “good”, the model may attempt to enhance positivity by substituting “good” with “amazing”.



Key Features

  • The FEEDBACK → REFINE → FEEDBACK loop can be applied multiple times.

  • Self-Refine retains the history of past experiences. This is achieved by appending the previous outputs to the prompt continuously. This allows the system to learn from past mistakes and avoid repeating them

  • Self-Refine generates actionable feedback. Given the initial output from an LLM, the FEEDBACK pinpoints the reasons for the output meeting (or not meeting) the requirements. The actionable feedback covers two aspects (i) localization of the problem (ii) instruction to improve.



Adding a new task

Adding a new task to Self-Refine is straightforward. The following code snippet shows the basic structure of the Self-Refine framework. The FEEDBACK and REFINE modules are task-dependent and can be implemented in a variety of ways. The is_refinement_sufficient function defines the stopping criteria for the iterative process. The function is_refinement_sufficient is also task-dependent and can be implemented in a variety of ways.

def self_refine(prompt: str) -> str:
    def is_refinement_sufficient(prompt, feedback, initial, refined) -> bool:
        # Define stopping criteria here
        pass

    answer = ChatGPT(prompt)

    while True:
        feedback = ChatGPT(feedback_prompt, answer)
        refined = ChatGPT(refiner_prompt, feedback, answer)

        if is_refinement_sufficient(prompt, feedback, answer, refined):
            break

        answer = refined

    return refined
                        
Overview of Self-Refine: given an initial output (left), FEEDBACK evaluates it and generates actionable feedback required to correct it (center). The REFINE takes the feedback into account, and refines the output (right). For example, in the top row, an initial review with negative sentiment is first transformed into a positive one, then further refined through feedback. In the bottom row, an initial code snippet is provided, followed by feedback identifying a more efficient approach, and finally resulting in an optimized code implementation after applying the suggested improvements.

Results summary

Tasks and Setup

We conduct extensive experiments on 7 diverse tasks of review rewriting, acronym generation, story generation, code rewriting, response generation, constrained generation, and toxicity removal, demonstrating that Self-Refine outperforms direct generation from strong generators like GPT-3.5 and even GPT-4 by at least 5% to more than 40% improvement




The diverse set of tasks to evaluate Self-Refine, along with their associated datasets and sizes. On the right, we show an example of a single iteration of refinement of input from the dataset x, previous generated output yt, feedback generated fb, and refinement produced yt+1 using f

Main results

We show the relative improvements in performance across diverse tasks using the Self-Refine framework. The improvements are measured as the percentage increase in preference rate revealed by human evaluation (“Human Eval.”) or task-specific performance metrics. This demonstrates the effectiveness of the iterative refinement process employed by Self-Refine in generating higher-quality outputs.
Self-Refine outperforms the baseline GPT-3 model on all baselines

Impact of Iterative Refinement Self-Refine aims to improve the output iterative following cycles of feedback and refinement. Does the output improve at each step? We analyze this question on three datasets: Sentiment Reversal, Math Reasoning, Code Optimization.

Self-Refine iteratively improves outputs. The rate of Refinement represents the average improvement in percentage points per iteration, calculated from Iteration 0 to Iteration 2

Self-Refine Prompt Examples

Examples of initial generation, feedback, and output is show for all the tasks are shown below. Clicking on any example pops up the input, the feedback generated, and the new output.

Acronym Generation



Dialogue Response Generation



Commonsense Generation



GSM Generation



Code Optimization



Code readability



Sentiment

Concurrent Research and Developments

Iterative refinement with self-feedback is a promising approach to improve large language models (LLMs). Many recent and concurrent works have explored this idea from different perspectives and applications - please checkout the twitter list below. These works are all important and relevant to our research, as they show the potential and challenges of using LLMs as both generators and evaluators of their own outputs.

Self-Refine in Social Media and News

BibTeX


@misc{madaan2023selfrefine,
    title={Self-Refine: Iterative Refinement with Self-Feedback}, 
    author={Aman Madaan and Niket Tandon and Prakhar Gupta and Skyler Hallinan and Luyu Gao and Sarah Wiegreffe and Uri Alon and Nouha Dziri and Shrimai Prabhumoye and Yiming Yang and Sean Welleck and Bodhisattwa Prasad Majumder and Shashank Gupta and Amir Yazdanbakhsh and Peter Clark}
    year={2023},
    eprint={2303.17651},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}