Abstract
We present DLSF, a novel framework for improving high-fidelity image synthesis using diffusion-based generative models. By introducing a dual-layer latent fusion strategy with Adaptive Global Fusion (AGF) and Dynamic Spatial Fusion (DSF), our method significantly enhances structural consistency and detail preservation compared to Stable Diffusion XL. Extensive evaluations demonstrate superior performance across FID, sFID, IS, and diversity metrics.
Pipeline Overview
Figure 1: DLSF Pipeline

Method
Our approach builds upon the SDXL architecture with two key modules. AGF performs adaptive cross-level feature harmonization, while DSF applies spatially-aware refinement via attention maps. Together, they create a fused latent representation that enhances both global semantics and local texture fidelity.
Quantitative Results
Table 1: Comparison at 256x256

Table 2: Comparison at 512x512

Qualitative Results
Figure 2: Comparison among SDXL, AGF, and DSF

Figure 3: AGF vs DSF samples

Ablation Study
Additional refinement after feature fusion leads to degraded performance in terms of FID and IS, highlighting the importance of carefully controlled latent integration.
Table 3: Ablation Study Results

Additional Samples
Figure 4: Text generation remains challenging (e.g. street signs)


Figure 5. Uncurated 1024 × 1024 AGF samples. Class label: “chickadee” (19), Classifier-free guidance scale = 5.0, DDIM steps = 15.

Figure 6. Uncurated 1024 × 1024 DSF samples. Class label: “jay” (17), Classifier-free guidance scale = 5.0, DDIM steps = 15.

Figure 7. Uncurated 1024 × 1024 AGF samples. Class label: “terrapin” (36), Classifier-free guidance scale = 5.0, DDIM steps = 15.

Figure 8. Uncurated 1024 × 1024 DSF samples. Class label: “macaque” (373), Classifier-free guidance scale = 5.0, DDIM steps = 15.

Figure 9. Uncurated 1024 × 1024 AGF samples. Class label: “brown bear” (294), Classifier-free guidance scale = 5.0, DDIM steps = 15.

Figure 10. Uncurated 1024 × 1024 DSF samples. Class label: “bison” (347), Classifier-free guidance scale = 5.0, DDIM steps = 15.

Figure 11. Uncurated 1024 × 1024 DSF samples. Class label: “altar” (406), Classifier-free guidance scale = 5.0, DDIM steps = 15.

Figure 12. Uncurated 1024 × 1024 DSF samples. Class label: “candle” (470), Classifier-free guidance scale = 5.0, DDIM steps = 15.
Resources
BibTeX
@inproceedings{DLSF2025, author = {Zhen-Qi Chen and Yuan-Fu Yang}, title = {DLSF: Dual-Layer Synergistic Fusion for High-Fidelity Image Synthesis}, booktitle = {Proceedings of the International Conference on Machine Vision Applications (MVA)}, year = {2025}, note = {Oral Presentation} }