ETIA: Enhancing Text2Image Surround View Scene Generation With Semantic Annotation via Diffusion for Autonomous Driving

Generating high-fidelity surround view images from text prompts is a complex task that requires balancing contextual coherence with computational efficiency. The proposed work introduces a novel methodology that combines recurrent attention-based encoder-decoder architectures with text-to-image diff...

Full description

Saved in:
Bibliographic Details
Main Authors: Ramyashree, S. Raghavendra, S. K. Abhilash, Venu Madhav Nookala, P. V. Arun Kumar, P. Malashree
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11087593/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849419884366135296
author Ramyashree
S. Raghavendra
S. K. Abhilash
Venu Madhav Nookala
P. V. Arun Kumar
P. Malashree
author_facet Ramyashree
S. Raghavendra
S. K. Abhilash
Venu Madhav Nookala
P. V. Arun Kumar
P. Malashree
author_sort Ramyashree
collection DOAJ
description Generating high-fidelity surround view images from text prompts is a complex task that requires balancing contextual coherence with computational efficiency. The proposed work introduces a novel methodology that combines recurrent attention-based encoder-decoder architectures with text-to-image diffusion models to produce coherent and continuous surround view images. The approach utilizes a custom text encoder to convert input text prompts into contextual embeddings, which are then processed by the proposed ViewNet Unet2d architecture within the decoder. This architecture employs dual cross-attention mechanisms: one aligns text embeddings with corresponding noise image latents, while the other integrates previously generated image latents to ensure continuity across the sequence. This method guarantees that each generated image adheres to its specific prompt, while maintaining coherence with preceding images. In addition, an annotation decoder was introduced that generates semantic segmentation maps, instance segmentation masks, and object detection annotations. The annotation decoder processes latent image maps using a shared feature extraction backbone and dedicated heads for each annotation task. Experimental results on the nuScenes validation set demonstrate the effectiveness of the proposed model in producing high-quality contextually aligned surround view images. The proposed model achieves an FVD of 99 and an FID of 12.6, outperforming existing methods such as Panacea+ and DriveDreamer-2. Furthermore, our approach improves segmentation and detection accuracy, achieving a PQ of 67.4, mIoU of 80.1, and mAP of 65.4, surpassing methods like OpenSeeD and D2Det. An ablation study highlights the contributions of key components in our architecture. Integrating positional encoding, self-attention, and concurrent attention significantly enhances generation quality, reducing FVD to 99 and FID to 12.6. Experimental results demonstrate the effectiveness of proposed work in producing high-quality, contextually aligned surround view images with comprehensive annotations, pushing the boundaries of text-to-image synthesis and scene understanding.
format Article
id doaj-art-0220d04561634b6780c19e032616fee4
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-0220d04561634b6780c19e032616fee42025-08-20T03:31:56ZengIEEEIEEE Access2169-35362025-01-011313220913222210.1109/ACCESS.2025.359114611087593ETIA: Enhancing Text2Image Surround View Scene Generation With Semantic Annotation via Diffusion for Autonomous Driving Ramyashree0https://orcid.org/0000-0002-0237-2444S. Raghavendra1https://orcid.org/0000-0003-2733-3916S. K. Abhilash2https://orcid.org/0000-0002-1119-4782Venu Madhav Nookala3https://orcid.org/0000-0002-0078-5050P. V. Arun Kumar4P. Malashree5Manipal Academy of Higher Education, Manipal Institute of Technology, Manipal, Karnataka, IndiaManipal Academy of Higher Education, Manipal Institute of Technology, Manipal, Karnataka, IndiaKPIT Technologies, Bengaluru, IndiaKPIT Technologies, Bengaluru, IndiaKPIT Technologies, Bengaluru, IndiaGovernment P.U College for Girls, Udupi, Karnataka, IndiaGenerating high-fidelity surround view images from text prompts is a complex task that requires balancing contextual coherence with computational efficiency. The proposed work introduces a novel methodology that combines recurrent attention-based encoder-decoder architectures with text-to-image diffusion models to produce coherent and continuous surround view images. The approach utilizes a custom text encoder to convert input text prompts into contextual embeddings, which are then processed by the proposed ViewNet Unet2d architecture within the decoder. This architecture employs dual cross-attention mechanisms: one aligns text embeddings with corresponding noise image latents, while the other integrates previously generated image latents to ensure continuity across the sequence. This method guarantees that each generated image adheres to its specific prompt, while maintaining coherence with preceding images. In addition, an annotation decoder was introduced that generates semantic segmentation maps, instance segmentation masks, and object detection annotations. The annotation decoder processes latent image maps using a shared feature extraction backbone and dedicated heads for each annotation task. Experimental results on the nuScenes validation set demonstrate the effectiveness of the proposed model in producing high-quality contextually aligned surround view images. The proposed model achieves an FVD of 99 and an FID of 12.6, outperforming existing methods such as Panacea+ and DriveDreamer-2. Furthermore, our approach improves segmentation and detection accuracy, achieving a PQ of 67.4, mIoU of 80.1, and mAP of 65.4, surpassing methods like OpenSeeD and D2Det. An ablation study highlights the contributions of key components in our architecture. Integrating positional encoding, self-attention, and concurrent attention significantly enhances generation quality, reducing FVD to 99 and FID to 12.6. Experimental results demonstrate the effectiveness of proposed work in producing high-quality, contextually aligned surround view images with comprehensive annotations, pushing the boundaries of text-to-image synthesis and scene understanding.https://ieeexplore.ieee.org/document/11087593/Annotation decoderdiffusion modelsself-attentionsegmentationViewNet Unet2d
spellingShingle Ramyashree
S. Raghavendra
S. K. Abhilash
Venu Madhav Nookala
P. V. Arun Kumar
P. Malashree
ETIA: Enhancing Text2Image Surround View Scene Generation With Semantic Annotation via Diffusion for Autonomous Driving
IEEE Access
Annotation decoder
diffusion models
self-attention
segmentation
ViewNet Unet2d
title ETIA: Enhancing Text2Image Surround View Scene Generation With Semantic Annotation via Diffusion for Autonomous Driving
title_full ETIA: Enhancing Text2Image Surround View Scene Generation With Semantic Annotation via Diffusion for Autonomous Driving
title_fullStr ETIA: Enhancing Text2Image Surround View Scene Generation With Semantic Annotation via Diffusion for Autonomous Driving
title_full_unstemmed ETIA: Enhancing Text2Image Surround View Scene Generation With Semantic Annotation via Diffusion for Autonomous Driving
title_short ETIA: Enhancing Text2Image Surround View Scene Generation With Semantic Annotation via Diffusion for Autonomous Driving
title_sort etia enhancing text2image surround view scene generation with semantic annotation via diffusion for autonomous driving
topic Annotation decoder
diffusion models
self-attention
segmentation
ViewNet Unet2d
url https://ieeexplore.ieee.org/document/11087593/
work_keys_str_mv AT ramyashree etiaenhancingtext2imagesurroundviewscenegenerationwithsemanticannotationviadiffusionforautonomousdriving
AT sraghavendra etiaenhancingtext2imagesurroundviewscenegenerationwithsemanticannotationviadiffusionforautonomousdriving
AT skabhilash etiaenhancingtext2imagesurroundviewscenegenerationwithsemanticannotationviadiffusionforautonomousdriving
AT venumadhavnookala etiaenhancingtext2imagesurroundviewscenegenerationwithsemanticannotationviadiffusionforautonomousdriving
AT pvarunkumar etiaenhancingtext2imagesurroundviewscenegenerationwithsemanticannotationviadiffusionforautonomousdriving
AT pmalashree etiaenhancingtext2imagesurroundviewscenegenerationwithsemanticannotationviadiffusionforautonomousdriving