DLSF: Dual-Layer Synergistic Fusion for High-Fidelity Image Synthesis

Abstract

We present DLSF, a novel framework for improving high-fidelity image synthesis using diffusion-based generative models. By introducing a dual-layer latent fusion strategy with Adaptive Global Fusion (AGF) and Dynamic Spatial Fusion (DSF), our method significantly enhances structural consistency and detail preservation compared to Stable Diffusion XL. Extensive evaluations demonstrate superior performance across FID, sFID, IS, and diversity metrics.

Pipeline Overview

Figure 1: DLSF Pipeline

Figure 1: DLSF Pipeline

Method

Our approach builds upon the SDXL architecture with two key modules. AGF performs adaptive cross-level feature harmonization, while DSF applies spatially-aware refinement via attention maps. Together, they create a fused latent representation that enhances both global semantics and local texture fidelity.

Quantitative Results

Table 1: Comparison at 256x256

Table 1: Comparison at 256x256

Table 2: Comparison at 512x512

Table 2: Comparison at 512x512

Qualitative Results

Figure 2: Comparison among SDXL, AGF, and DSF

Figure 2: Qualitative Comparison

Figure 3: AGF vs DSF samples

Figure 3: AGF vs DSF Sample Comparison

Ablation Study

Additional refinement after feature fusion leads to degraded performance in terms of FID and IS, highlighting the importance of carefully controlled latent integration.

Table 3: Ablation Study Results

Table 3: Ablation Study Results

Additional Samples

Figure 4: Text generation remains challenging (e.g. street signs)

Figure 4: Text generation issue
Figure 5: AGF sample (chickadee)

Figure 5. Uncurated 1024 × 1024 AGF samples. Class label: “chickadee” (19), Classifier-free guidance scale = 5.0, DDIM steps = 15.

Figure 6: DSF sample (jay)

Figure 6. Uncurated 1024 × 1024 DSF samples. Class label: “jay” (17), Classifier-free guidance scale = 5.0, DDIM steps = 15.

Figure 7: AGF sample (terrapin)

Figure 7. Uncurated 1024 × 1024 AGF samples. Class label: “terrapin” (36), Classifier-free guidance scale = 5.0, DDIM steps = 15.

Figure 8: DSF sample (macaque)

Figure 8. Uncurated 1024 × 1024 DSF samples. Class label: “macaque” (373), Classifier-free guidance scale = 5.0, DDIM steps = 15.

Figure 9: AGF sample (brown bear)

Figure 9. Uncurated 1024 × 1024 AGF samples. Class label: “brown bear” (294), Classifier-free guidance scale = 5.0, DDIM steps = 15.

Figure 10: DSF sample (bison)

Figure 10. Uncurated 1024 × 1024 DSF samples. Class label: “bison” (347), Classifier-free guidance scale = 5.0, DDIM steps = 15.

Figure 11: DSF sample (altar)

Figure 11. Uncurated 1024 × 1024 DSF samples. Class label: “altar” (406), Classifier-free guidance scale = 5.0, DDIM steps = 15.

Figure 12: DSF sample (candle)

Figure 12. Uncurated 1024 × 1024 DSF samples. Class label: “candle” (470), Classifier-free guidance scale = 5.0, DDIM steps = 15.

Resources

GitHub Repository Paper BibTeX

BibTeX

          @inproceedings{DLSF2025,
            author    = {Zhen-Qi Chen and Yuan-Fu Yang},
            title     = {DLSF: Dual-Layer Synergistic Fusion for High-Fidelity Image Synthesis},
            booktitle = {Proceedings of the International Conference on Machine Vision Applications (MVA)},
            year      = {2025},
            note      = {Oral Presentation}
          }