Photolithography Overlay Map Generation with Implicit Knowledge Distillation Diffusion Transformer

Diffusion Transformer Photolithography Overlay Map Generation Semiconductor Manufacturing

Introduction

IKDDiT explores how diffusion models and knowledge distillation can improve overlay map generation in semiconductor photolithography. This page provides an accessible overview of the motivation, the core design, and representative results, with figures consolidated from the paper and its supplementary material.

Video

Core Idea

Rather than relying on a heavy standalone model, IKDDiT uses a distilled teacher network to inject semiconductor-specific priors into a compact diffusion generator. The result is a model that is both efficient and accurate for overlay map synthesis.

Architecture of the IKDDiT, which leverages a pre-trained text encoder ε_{φ_t} and an image encoder ε_{φ_i}, developed through unified contrastive learning, to generate conditional tokens. These tokens are subsequently processed by the teacher and student DiT encoders to perform a self-supervised discriminative process using D_φ within the joint embedding space.

In short: a compact diffusion model enhanced by knowledge transfer for manufacturing data.

Architecture

Architecture of the IKDDiT, which utilizes pre-trained text encoder ε_{φ_t} and image encoder ε_{φ_i}, developed through unified contrastive learning, to generate conditional tokens. These tokens are then processed through the teacher and student DiT encoders to perform a self-supervised discriminative process using D_φ within the joint embedding space.

Training efficiency

To evaluate the convergence behavior of our IKDDiT model, we compare FID scores across training stages against state-of-the-art baselines. All models, in the XL configuration, are trained with a batch size of 64 for up to 578.1k iterations. As shown in Figure 5, IKDDiT exhibits consistently faster convergence. At 250k iterations, IKDDiT reaches an FID of 11.6, already surpassing DiT, MDT, and MaskDiT, which only achieve FID scores of 14.1, 12.2, and 11.9, respectively, after 500k iterations. Furthermore, IKDDiT attains an FID of 6.8 at 500k iterations, outperforming all competing methods. These results demonstrate that IKDDiT converges nearly twice as fast, underscoring the effectiveness of incorporating self-supervised discrimination into DiT training.

Results

Scalability and Model Configurations.

Model Scaling on Training Loss

Visualization Result

Qualitative Comparison on Overlay Map Generation.

Representative Results by Our Proposed Model.

Resources

Main Paper · Supplementary

BibTeX

@inproceedings{IKDDiT2025,
  author    = {Yuan-Fu Yang and Hsiu-Hui Hsiao},
  title     = {Photolithography Overlay Map Generation with Implicit Knowledge Distillation Diffusion Transformer},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year      = {2025},
  note      = {To appear}
}

Replace with the official venue and full bibliographic information once finalized.