FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing

1AGI Lab, Westlake University     2Central South University
*This work was done during Guangzhao Li's visit at AGI Lab, Westlake University.
Indicates corresponding author

Featured Video Editing Results

Interactive before-and-after comparisons showcasing our editing capabilities. Drag the divider to explore.

Abstract

Text-driven video editing aims to modify video content according to natural language instructions. While recent training-free approaches have made progress by leveraging pre-trained diffusion models, they typically rely on inversion-based techniques that map input videos into the latent space, which often leads to temporal inconsistencies and degraded structural fidelity. To address this, we propose FlowDirector, a novel inversion-free video editing framework. Our framework models the editing process as a direct evolution in data space, guiding the video via an Ordinary Differential Equation (ODE) to smoothly transition along its inherent spatiotemporal manifold, thereby preserving temporal coherence and structural details. To achieve localized and controllable edits, we introduce an attention-guided masking mechanism that modulates the ODE velocity field, preserving non-target regions both spatially and temporally. Furthermore, to address incomplete edits and enhance semantic alignment with editing instructions, we present a guidance-enhanced editing strategy inspired by Classifier-Free Guidance, which leverages differential signals between multiple candidate flows to steer the editing trajectory toward stronger semantic alignment without compromising structural consistency. Extensive experiments across benchmarks demonstrate that FlowDirector achieves state-of-the-art performance in instruction adherence, temporal consistency, and background preservation, establishing a new paradigm for efficient and coherent video editing without inversion.

Method

FlowDirector Method Overview

FlowDirector is a novel, inversion-free video editing framework operating directly in data space. It comprises three key components: (1) Editing Flow Generation, which creates direct source-to-target paths using velocity differences from a pre-trained text-to-video model, avoiding noise inversion; (2) Spatially Attentive Flow Correction (SAFC), an attention-guided mask preserving non-target regions; and (3) Differential Averaging Guidance (DAG), which uses differential velocity signals for stronger semantic alignment and structural consistency. This integrated approach yields efficient, coherent video editing with superior instruction adherence, temporal consistency, and background preservation.

Video Editing Results

Click on any video to toggle between edited and original versions

BibTeX

@article{li2025flowdirector0,
  title   = {FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing},
  author  = {Guangzhao Li and Yanming Yang and Chenxi Song and Chi Zhang},
  year    = {2025},
  journal = {arXiv preprint arXiv: 2506.05046}
}