Guiding 3D Digital Content Generation with Pre-Trained Diffusion Models

.


I. INTRODUCTION
Humans describe the world through text, comprehend it through images, and experience and interact with it in a threedimensional (3D) format.Therefore, generative models have found widespread application in numerous aspects of life, playing a significant role in advancing human society.Research in recent years has mainly focused on text generation [1], [2], [3], [4] and image generation [5], [6], [7], [8].Text generation is typically used for language tasks such as translation and question-answering, while image generation often involves creating visuals based on textual prompts.The generation of 3D digital content has not yet achieved the extraordinary capabilities seen in the domains of text and image generation.Therefore, there is still a need to continue to promote related research on 3D digital content generation.3D digital content is extensively utilized in fields such as film, architecture, virtual and augmented reality.However, the current mainstream production of 3D digital content relies heavily on 3D designers, leading to remarkably low production efficiency and high entry barriers.Consequently, employing artificial intelligence (AI) to generate 3D digital content can significantly enhance production efficiency, reduce industry barriers, and foster the development of related fields.
Zero-shot image models [9] are trained using hundreds of millions of graphics data, which is difficult to achieve in the 3D domain.Table I presents a comparison between the data volumes of mainstream 3D and 2D datasets.Conventional 3D digital content generation methodologies predominantly utilize 3D datasets for training specific generative models [10], [11].The advantage of this method lies in its ability to generate 3D objects with consistent geometry.However, it is limited by the current lack of sufficiently large 3D datasets and the absence of efficient 3D digital content generation architectures, as well as the computational power needed for their training.Therefore, it is difficult for this 3D digital content generation method to achieve a breakthrough in the short term.In light of this, this paper focuses on using pre-trained diffusion models [7], [12], [13] to supervise the generation of 3D digital content.14M AKB-48 [16] ✓ 2K COCO [17] 330K OmniObject3D [18] ✓ 6K Open Image V7 [19] 9M ScanObjectNN [20] 15K Places [21] 10M 3D-Future [22] ✓ 16K LSUN [23] 59M Diffusion models, trained on billions of image-text pairs, have propelled the latest advancements in text-to-image generation, demonstrating the capability to produce high-fidelity images under textual prompts [24], [25], [26], [27], [28].Utilizing pre-trained diffusion models for generating 3D digital content [29], [30] significantly reduces computational power requirements and dependence on 3D datasets, thereby greatly enhancing the feasibility and efficiency of 3D digital content generation.This paper meticulously investigates and analyzes methods for generating 3D digital content, focusing on two key aspects: diffusion model priors and 3D representations.The generation of 3D digital content is categorized into two types based on the task: text-to-3D [29], [30], [31], [32], [33], [34], [35], [36], [37] and image-to-3D [35], [38], [39], [40], [41].To compare the strengths and limitations of each approach, this study conducts a horizontal comparison of different models in terms of efficiency and quality.This paper also explores the challenges associated with generating 3D digital content using pre-trained diffusion models and discusses potential solutions to these issues.
Our contributions are summarized as follows: • This paper delivers an exhaustive review and investigation of methods for generating 3D digital content, with a foundation in diffusion models.
• A horizontal comparison and analysis are conducted in this paper to discern variations in efficiency and quality among different models.
• Several viable solutions are proposed in this paper to address the current challenges in generating 3D digital content using diffusion models.
• Potential future research directions in the field of 3D digital content generation, guided by diffusion models, are outlined in this paper.
Additionally, it is worth noting that there is currently a lack of universally recognized evaluation metrics for text-to-3D digital content generation.We currently assess quality solely through visual observation, which introduces a certain level of subjectivity.In the realm of image-to-3D digital content generation, we will employ image-based metrics to objectively evaluate the generated 3D digital content.Furthermore, due to limitations in laboratory conditions, all experiments in this paper were conducted using a single A40 GPU, and the results are presented accordingly.This paper is organized as follows: Section II introduces the relevant background knowledge on 3D representation methods and diffusion models.Section III conducts a comprehensive analysis and study of the schemes, algorithms, and workflows for both text-to-3D and image-to-3D conversions.Section IV provides a holistic evaluation of existing 3D content generation approaches, analyzing the strengths and limitations of different methodologies.Section V explores the current challenges and proposes envisioned solutions.Finally, the paper concludes with a summary and presents our thoughts on future research directions and themes in this field.

II. RELATED WORK
The generation of 3D digital content based on diffusion models principally involves two components: 3D representation and diffusion priors.DreamFusion [29] pioneered the integration of diffusion models into the task of 3D digital content generation.Subsequent studies in this domain have been categorized into two approaches based on their characteristics: optimization-based methods [42] and multi-view predictionbased methods [43], [44].The focal point of research in this field has been centered on optimizing 3D representations or fine-tuning diffusion models.
1) Neural radiance fields: NeRF uses a neural network to learn the continuous volume density and color of a scene [53].Central to NeRF is the utilization of a Multi-Layer Perceptron to parametrically represent 3D objects, enabling high-quality synthesis of new viewpoint images.Theoretically, it can model shapes at any spatial resolution [54].The MLP parameters, denoted as θ, take the camera pose c as input.The output comprises color and density.The process involves camera rays traversing the scene, generating a set of sample points along the ray path.The color and transparency of each sampled point on the ray are cumulatively processed to synthesize the color of each pixel.Subsequently, these colors and densities are utilized in volume rendering to generate the image g(θ, c).NeRF can learn from a series of 2D images taken from different angles and synthesize highly realistic new viewpoint images, which is crucial for achieving realistic 3D scene reconstruction.
2) 3D gaussian splatting: Structure-from-Motion (SfM) [55] can estimate point cloud distributions from a set of images using the COLMAP library.The work of 3D Gaussian Splatting starts with sparse SfM points, modeling the geometry as a set of 3D Gaussian functions.The fundamental idea of 3D Gaussian Splatting is to consider each point as the center of a Gaussian distribution.These points, rather than being isolated discrete entities, have a smooth, continuous weight distribution around them.Each point influences its surrounding area, quantified by a Gaussian function.Each 3D Gaussian is defined by the point's position, covariance matrix, and opacity α.Specifically, the point's position is the mean of the 3D Gaussian, the covariance matrix determines the shape of the 3D Gaussian, and the opacity α is used for splatting, with spherical harmonics (SH) [56], [57] representing color.The method uses adaptive Gaussian densification to control the number and density of Gaussians per unit volume.This approach overcomes the issues of slow rendering speed or compromised image quality in previous methods, enabling high-quality, realtime novel view synthesis at 1080p resolution.

B. Diffusion Models
Diffusion models consist of a forward process q t,t∈[0,1] , and a reverse process p t,t∈[0,1] .The forward process resembles a straightforward Brownian motion with time-varying coefficients [58].Specifically, this process incrementally adds noise ϵ ∈ N (0,I) to the original data x 0 , thereby gradually transitioning the data distribution towards a Gaussian noise distribution [12], [59].This step-by-step addition of noise effectively transforms the original data into a state that aligns with a predefined Gaussian distribution, laying the groundwork for the subsequent reverse process.Conversely, the reverse process employs a neural network to estimate the noise added at each step of the forward process, progressively denoising the Gaussian distribution noise to ultimately restore the original data distribution.The distribution in the forward process is given by q t (x t |x 0 ) := N (α t x 0 , σ 2 t I) and q t (x t ) := q t (x t |x 0 )q 0 (x 0 )dx 0 .The coefficients α t and σ t are selected to regulate the proportion of original data and noise.At the onset of the forward process, σ 0 ≈ 0, while at the end, σ 1 ≈ 1, where α 2 t = 1 − σ 2 t [60], [61].This careful adjustment of coefficients ensures a gradual and controlled transformation of the data.The reverse process, through a noise prediction network ϵ ϕ (x t , t), predicts the noise added at each forward step.The overall training is conducted by minimizing where, ω(t) is a weighting function that depends on the timestep t. the noise prediction network can be used for approximating the score function of both q t and p t by Incorporating textual control within diffusion models enhances the controllability of the generated content [8].Since each image adheres to a specific distribution pattern, utilizing the information embedded within the text as a directive allows for the progressive denoising of Gaussian noise images, culminating in the generation of images that align with the textual information.This process specifically involves training an encoder and a decoder, where the encoder maps images to a latent space and the decoder reconstructs images from this latent space data.The textual prompts y are encoded using a text encoder τ θ (y) and are integrated into each step of the denoising process, which is trained by minimizing By introducing conditions into the noise reconstruction process, controlled image generation is achieved.This methodology exhibits robustness in producing high-resolution images with intricate details while maintaining the semantic structure of the images [62].

III. METHODOLOGY
Diffusion models demonstrate extraordinary zero-shot capabilities in generating diverse images from textual descriptions.Fig. 1. demonstrates the ability of diffusion models to create multi-angular images using textual prompts.
"Cute dog's front" "Cute dog's side" "Cute dog's back" "The front of a car" "The back of a car" "The side of a car" Pre-trained diffusion models, having been trained with a vast array of internet data, have acquired an understanding of the distribution of images of most objects from various viewpoints [63].By leveraging the geometric priors learned from natural images by large-scale diffusion models and integrating viewpoint control, fine-tuning these pre-trained models enables the generation of images from different perspectives.The viewpoint-conditioned diffusion models (Zero-1-to-3) [63] learn the relative control of camera perspectives using synthetic datasets, thereby facilitating the creation of novel views of the same object under specified camera transformations.Fig. 2. demonstrates the capability of the viewpoint-conditioned diffusion models to take a single-perspective image as input and generate images from diverse viewpoints.The specific steps for using diffusion models as a prior to guide the generation of 3D digital content are as follows: First, initialize a 3D model, then continuously modify the shape of the 3D model according to the prompt.Upon completion of the iterative process, the final 3D model, when rendered from any perspective, aligns consistently with the content described in the prompt.

A. Text-to-3D
The work on generating 3D digital content from textual prompts is built upon the foundations of text-to-image diffusion models [8], [26], [27], [28].Given that the end product of diffusion models is an image, it's not feasible to directly use the results of diffusion models to supervise the generation of 3D digital content.However, it's possible to utilize the denoising process to guide this generation.The forward process of the diffusion model involves adding noise to the original data x 0 at timestep t, resulting in a noised image α t x 0 +σ t ϵ.During the reverse process, the noise prediction network estimates the noise ϵ added at each step, thus the denoised image can be represented as x ϕ = [(α t x 0 + σ t ϵ) − σ t ϵ ϕ )]/α t .This indicates that as long as the noise prediction is sufficiently accurate, the final image generated from Gaussian noise will also be accurate.
DreamFusion [29]  σ t ϵ [30].The pre-trained diffusion model predicts the sampling noise ϵ ϕ given a noisy image x t , noise time step t, and text embedding y.It provides a gradient direction to update the 3D volumetric parameters θ, with the overall gradient computed by the SDS function.
Here, ω(t) is a weighting function.The scene model G and the diffusion model ϕ can be considered as modular components.It can be demonstrated that this loss fundamentally measures the similarity between the rendered images and textual prompts [40].During the iterative process, the SDS loss backpropagates only to update the NeRF parameters θ, without altering the pre-trained diffusion model.As iterations progress, the 3D object gradually exhibits textures and geometric shapes that align with the textual prompt.The overall network architecture is succinctly illustrated in Fig. 3.

B. Image-to-3D
People possess the ability to envision the 3D structure of an object from a single image, a skill largely derived from the vast amount of prior knowledge accumulated through life experiences.Much of the past research has focused on reconstructing 3D models from multi-angle images [56], [57], [64], [65].This approach is intuitive, as multiple viewpoints are essential for acquiring 3D information.However, 3D reconstruction from multi-angle images remains inefficient.This method requires the collection and acquisition of images from multiple angles, implying that it can only reconstruct objects that already exist in the real world.An interesting aspect is that in industries with a high demand for 3D digital assets, such as gaming, virtual reality, and animation, the focus lies on innovative 3D models rather than mere reproductions of the real world.Typically, the creation of an original 3D model involves numerous steps, as illustrated in Fig. 4. A promising approach to creating the requisite 3D models is through the generation of corresponding 3D models from a single image.While achieving controllability in diffusion models is a hot topic in further research [66], [67], [68], [69], there still lacks effective means to precisely control the images they generate.Consequently, the 3D models produced using textto-3D methods may not always meet specific requirements.In other words, when inputting text prompts, no one can predict the structure of the 3D model until the result is generated.At this juncture, the task of image-to-3D conversion gains a significant advantage.
DreamFusion [29] achieves a text-to-3D generation method based on diffusion priors, demonstrating the exceptional capability of using diffusion priors to optimize NeRF.Related work [38], [41], [70] attempts to apply diffusion priors to single-image 3D generation.Owing to the fact that pre-trained diffusion models are primed with textual prompts, the approach for image-to-3D tasks diverges from that of text-to-3D tasks.Specifically, image-to-3D requires a process of textual inversion [68], differentiating it from the generation method used in text-to-3D tasks.A simplified network structure is illustrated in Fig. 5.The generation of 3D digital content from a single image is typically a two-stage process.The primary task of the coarse stage is to establish the model's basic outline, followed by refinement in the refine stage.Specifically, the coarse stage begins with preprocessing such as background removal [71], textual inversion [68], and depth estimation [72], [73] of the reference image.Background removal focuses on isolating the main object for modeling, while textual inversion generates corresponding textual descriptions to guide the diffusion prior.Depth estimation provides a prior for depth information, supervising subsequent model generation.The overall process starts with initializing a 3D model, rendering images from random angles with added Gaussian noise, and then using a diffusion model to optimize the 3D model through back propagation using SDS loss and a series of reference image losses.

1) Reference view reconstruction loss:
To ensure consistency between rendered images G θ (c) from reference viewpoints c and the reference images x 0 themselves, a reference view reconstruction loss is typically introduced at the reference viewpoints.This involves the use of Mean Squared Error (MSE) loss on the reference images and their masks.
Here, θ represents the parameters of the 3D object being optimized, ⊙ is Hadamard product, M is related to the mask, M (•) is the foreground mask acquired by the volume density along the ray of each pixel.λ rgb , λ mask are the weights for the foreground RGB and the mask [38].
2) Depth prior: At reference viewpoints, relying solely on reference view reconstruction loss may result in poor geometric shapes.To address shape blur, indentations, and flatness, a depth prior is typically incorporated.Specifically, this involves using a pre-trained monocular depth estimator [72] to assess the depth d of the reference image.The depth of the 3D content viewed from the reference viewpoint should closely match this depth prior.Generally, negative Pearson correlation is used for depth regularization.
Here, Cov(•) denotes covariance, and V ar(•) calculates standard deviation, d(c) refers to the depth modeled at the reference viewpoint.Through the use of reference view reconstruction loss and depth prior the alignment between the reference image and the 3D model at the reference viewpoint can be optimized as much as possible.Although the estimated depth may not accurately represent geometric details, it is sufficient to ensure reasonable geometric shape and resolve most ambiguities [40].Furthermore, normal smoothness loss [38] and diffusion CLIP loss [40] can also be added.

3) Diffusion prior:
The supervision of novel view generation is guided by a diffusion prior.Textual inversion is used to generate textual descriptions y for the reference images.The SDS loss is employed for the continuous optimization of the 3D model.
The reference view loss includes details not captured by textual prompts, and SDS loss ensures the generated 3D model conforms to the object's expected shape.Combined, they ensure the model generation is faithful both to the reference image and to the textual prompts.
Upon completion of the coarse stage, the generated 3D model possesses a reasonable geometric shape, yet its overall geometric structure and texture remain somewhat rough.Based on the 3D model produced in the coarse stage, a model refinement network [76] can be utilized for further refinement, enhancing its geometric structure and texture.The overall optimization process is fundamentally similar to that of the coarse stage.

IV. EXPERIMENTS
In accordance with the primary research focus of this paper, we categorize the current frameworks for 3D digital content generation based on diffusion models into two distinct types: text-to-3D and image-to-3D.All experimental results were obtained using a single A40 GPU.Our analysis primarily concentrates on two key aspects: the quality of the generated content and the speed of generation.

A. Text-to-3D
In the comparative experiments of text-to-3D digital content generation, we encountered frameworks that were either open-source or proprietary.For the open-source frameworks, experiments were conducted using the original codes from the respective papers.In the case of proprietary frameworks, we uniformly utilized threestudio [77] for experimentation.We acknowledge that there might be slight deviations in the results generated by threestudio compared to the original outcomes; however, we believe these differences do not significantly impact our evaluative conclusions.Additionally, in the realm of text-to-3D digital content generation, there are no universally accepted benchmarks for performance evaluation.Consequently, qualitative assessments were primarily based on visual inspections conducted by human observers.In our detailed experiments, we compare recent methods (DreamFusion [29], Latent-NeRF [74], Score Jacobian Chaining [34], Prolific-Dreamer [75], DreamGaussian [35]) for generating 3D objects from a textual prompt.Furthermore, considering the influence of textual prompt types on the model's generative performance, we employed two categories of textual descriptions: realitybased and imagination-based.The results of the generation are illustrated in Fig. 6.
Through a comparative analysis of the generated mesh quality and the overall generation time, as detailed in Table II, we observed that for objects existing in reality, ProlificDreamer [75] exhibits the highest quality of generation, albeit at the slowest speed.While DreamGaussian [35] may not match the former in terms of quality, it outperforms in generation speed.For imaginary objects, current mainstream frameworks struggle to achieve high-quality generation.We propose two avenues for optimization: firstly, refining textual prompts to more intricately describe the content envisioned, which could enhance the resultant generation.Secondly, augmenting the capabilities of the diffusion model by training it with larger datasets.
ProlificDreamer [75] proposed the use of Variational Score Distillation (VSD) to address issues such as over-saturation, over-smoothing, and low-diversity in the SDS loss.The core concept involves sampling within the distribution of 3D scenes, representing the 3D distribution with 3D parameter particles.A gradient-based particle updating rule is derived based on Wasserstein gradient flow.Despite its ability to achieve highquality generation results, Prolificdreamer's method requires alternating training between LoRA [78] and NeRF during the training process, leading to prolonged training times.In contrast, DreamGaussian [35] employs 3D Gaussian Splatting [52]

B. Image-to-3D
In the comparative experiments for image-to-3D digital content generation tasks, we utilized the RealFusion [41] dataset, comprising 15 distinct objects, for our analysis.We compare recent methods (Zero-1-to-3 [63], Magic123 [38], DreamGaussian [35], Stable Zero123) for generating 3D objects from a single unposed image, with specific experimental results depicted in Fig. 7. Unlike the generation of 3D digital content from textual prompt, the quality of 3D content generated from a single image can be assessed based on imagerelated metrics.
1) PSNR: PSNR is a widely used standard for quantifying the quality of image reconstruction or image compression.It measures the pixel-level differences between the original and the compressed or reconstructed image.PSNR is calculated based on the Mean Squared Error (MSE) between the two images.Generally, a higher PSNR value indicates that the reconstructed image is closer in quality to the original image.It primarily evaluates the pixel-level similarity between the reconstructed or compressed image and the original image, but it may not always align with human perceptual differences.
2) LPIPS: LPIPS is a more modern, deep learning-based metric used to assess the perceptual quality and similarity of images.LPIPS calculates the similarity by comparing the activations of a deep neural network when processing two images.This approach aims to more closely resemble the human visual perception system.LPIPS is used to evaluate the perceptual similarity of images, especially in cases where pixel-level metrics may not capture all aspects of human perception.
3) CLIP-Similarity: CLIP-Similarity is a metric used to evaluate the semantic similarity between images, based on features extracted by the CLIP model.Unlike traditional image similarity metrics that focus on pixel-level details, CLIP-Similarity measures how semantically or contextually similar two images are.CLIP-similarity is particularly useful when the evaluation criteria extend beyond mere visual or pixellevel accuracy and venture into the realm of contextual and conceptual alignment.
For evaluating the quality of generated 3D content from reference viewpoints, we follow the metrics used in previous studies [41], [70].We employed PSNR and LPIPS [79] metrics to compare the rendered images against the reference images, thereby assessing the generation quality from reference viewpoints.For images rendered from novel viewpoints, the quality was evaluated using the CLIP-similarity [9], as presented in Table III.Moreover, because we preprocess the original images in the process of generating 3D digital content from images, we applied the same treatments to the rendered images of the final 3D models during comparisons to ensure the accuracy of experimental results.
Our findings reveal that DreamGaussian [35] exhibits the fastest generation speed and achieves the highest quality when viewed from a reference perspective.However, it is noteworthy that its performance in generating novel views is comparatively inferior.On the other hand, Magic123 [38] demonstrates superior performance in generating high-quality novel views by incorporating a dual prior in both 2D and 3D dimensions.Simultaneously, the experimental results also confirm that the combination of diffusion models and 3D Gaussian Splatting [52] can achieve rapid 3D digital content generation, although there is room for further improvement in generation quality.

V. DISCUSSION
This study analyzes the frameworks related to text-to-3D content generation and image-to-3D content generation based on diffusion models, conducting extensive experiments.Through experimental comparative analysis, we identified numerous challenges in 3D content generation based on diffusion models.
A. Current Issues 1) Janus problem: Due to the primary approach of utilizing the diffusion model to guide rendering images from various perspectives, subsequently directing the generation of 3D models, the Janus problem is pervasive in the task of 3D digital content generation based on the diffusion model.
2) Over-Saturation: Using SDS loss in the generation of 3D content leads to issues such as over-saturation, oversmoothing, and low-diversity problems.
3) Controllability: Achieving precise control in the generation of 3D content from text prompts is challenging, relying solely on textual cues.4) Editability: Currently, there is no effective means to edit generated 3D content through artificial intelligence.
5) Imagination: Despite effective generation for real-world objects, the diffusion model struggles with the 3D reasoning and imagination capabilities required for generating novel objects.
6) Primary view dependency: Tasks involving the generation of 3D content from a single image often require the input to be the primary view of the target object.

7) Evaluation metrics:
A lack of a unified evaluation system for assessing the quality of generated 3D content.
8) Generation quality: Diffusion model-based 3D object generation faces issues of insufficient generation quality, resulting in objects that may lack realism or exhibit insufficient detail.9) Shape inconsistency: Generated 3D objects may exhibit shape inconsistencies, particularly with complex geometric structures or topological relationships.10)Scale disparities: Current 3D content generation models struggle to effectively handle objects of varying scales and are unable to generate 3D models of different sizes based on specific requirements.

B. Potential Solutions to Some Issues
1) Janus problem: To address the Janus problem, employing multi-view [80] or 3D perception [37] diffusion models can help alleviate the issue.Additionally, an incremental modeling approach, similar to a "humanoid printer", can be applied, generating 3D models for partial views gradually.
2) Over-Saturation: An approach akin to that proposed by prolificdreamer [75], employing Variational Score Distillation (VSD), can be adopted to address the issue of over-saturation and further enhance the quality of generated 3D models.However, it is noteworthy that this method may lead to a reduction in efficiency.
3) Controllability: While achieving controllability in textto-3D content generation tasks remains challenging, leveraging image-to-3D generation tasks can facilitate more controlled 3D content generation.4) Editability: Editing of 3D content can be achieved through image editing techniques [66] or by combining Chat-GPT [1] to map text or voice into latent space for effective editing.

5) Imagination:
In order to improve the generation performance of models, it is suggested to employ richer semantic description information.Alternatively, a more powerful diffusion model can be trained by incorporating a larger dataset.These strategies aim to enhance the overall effectiveness of the model in generating high-quality outputs.
6) Primary view dependency: Further enhancing the capabilities of novel view synthesis models to generate primary views of objects based on input images.

VI. CONCLUSION
With the continuous development of generative artificial intelligence, the scope of generated content is expanding beyond text, audio, and image domains, gradually progressing towards the generation of 3D objects and environments.Fueled by the visions of virtual reality, augmented reality, and the metaverse, the demand for 3D digital content across various industries is expected to further burgeon.
Current research indicates that different frameworks for 3D digital content generation exhibit advantages and limitations in terms of both generation quality and efficiency.Through our specific investigations, we posit that the integration of diffusion models and 3D Gaussian Splatting will be a focal point in the future research of 3D digital content generation.Additionally, constrained by the controllability issue in text-to-3D, a viable workflow for 3D digital content generation is as follows: firstly, generate images from text, providing creators with creative input.Subsequently, employ artificial intelligence to optimize and edit the image content to achieve the desired appearance.Then, use an image-to-3D generation framework to create a 3D model.Finally, import the generated 3D model into 3D modeling software for further refinement.
With the advancement of 3D object generation frameworks, future research is expected to extend from individual objects to scene generation.How to integrate procedural scene generation with artificial intelligence in the future is a question worthy of consideration.
In summary, this review comprehensively elucidates how diffusion models can be leveraged for 3D digital content generation.We analyze key frameworks for 3D digital content generation and experimentally validate the efficiency and feasibility of combining diffusion models with 3D Gaussian Splatting for modeling.We summarize the existing challenges in 3D digital content generation based on diffusion models and propose potential solutions for some of these issues.Overall, we contend that image-to-3D digital content generation aligns more closely with societal applications, though we remain optimistic about the future of text-to-3D digital content generation.

Fig. 2 .
Fig. 2. Generate different perspective images from a single viewpoint image.

Fig. 3 .
Fig. 3.A simplified framework for generating 3D digital content based on text prompts.

Fig. 5 .
Fig. 5.A Two-stage framework for generating 3D digital content from a single image using diffusion priors.

Fig. 7 .
Fig. 7. Qualitative comparisons of 3D digital content generation from a single image.

TABLE I .
COMPARISON OF 3D DATASETS AND 2D DATASETS

TABLE II .
MULTI-PERSPECTIVE COMPARATIVE ASSESSMENT OF TEXT-TO-3D DIGITAL CONTENT GENERATION FRAMEWORKS

TABLE III .
QUANTITATIVE RESULTS ARE PROVIDED FOR PSNR ↑, LPIPS ↓, AND CLIP-SIMILARITY ↑