2.1. Digital Image Forensics
The widespread availability of image editing software, equipped with increasingly potent capabilities, amplifies the challenges from image forgery, prompting an urgent demand for image forensic techniques. Two primary approaches, active and passive, address the detection of image content forgery or manipulation. Active methods involve embedding imperceptible digital watermarks into images, where the extracted watermarks or their specific conditions may reveal potential manipulations applied to an investigated image. However, drawbacks include the protection of only watermarked images, and the content being somewhat affected by the watermark embedding process. Moreover, controversies surround the responsibility for detecting or extracting watermarks to establish image authenticity.
In contrast, passive methods operate on the premise that even seemingly imperceptible image manipulations alter the statistical characteristics of imagery data. Rather than introducing additional signals into images, passive methods seek to uncover underlying inconsistencies to detect image manipulations or forgery [
1,
2]. Among passive methods, camera model recognition [
3] is a promising direction in image forensics. This involves determining the type of camera used to capture the images. Various recognition methods utilizing features such as CFA demosaic effects [
4], sensor noise patterns [
5], local binary patterns [
6], noise models [
7], chromatic aberrations [
8], and illumination direction [
9] have been proposed. Classification based on these features helps to determine the camera model types. Recent work has utilized deep learning [
10] to pursue generality.
Despite the effectiveness demonstrated by camera model identification methods, a prevalent limitation is evident in many of these techniques. They operate under the assumption that the camera models to be determined must be part of a predefined dataset, requiring a foundational understanding of these models included in the training data to precisely recognize images captured by specific cameras. However, selecting a suitable set of camera models is not a trivial issue. Expanding the training dataset to encompass all camera types is also impractical, given the fact that the number of camera types continues to grow over time. In image forensics, identifying the exact camera model responsible for capturing an image in various investigative contexts is not necessary as the primary objective here is to confirm whether a set of examined images are from the same camera model to expose instances of intellectual property infringement or to reveal the presence of image splicing fraud [
11]. Bondi et al. [
12] discovered that CNN-based camera model identification can be employed for classifying and localizing image splicing operations. Ref. [
13] fed features extracted by the network into a Siamese network for further comparison. Building upon these findings, the proposed UFCC scheme aims to extend the ideas presented by [
13] to develop a unified approach for detecting manipulated image regions and determining the authenticity of videos.
2.2. Deepfake
Deepfake operations are frequently utilized for facial manipulation, changing the visual representation of individuals in videos to diverge from the original content or substitute their identities. The spectrum of Deepfake operations has expanded over time and elicited numerous concerns. Next, we analyze the strengths and weaknesses of each Deepfake technique and track the evolution of these approaches.
Identity swapping entails substituting faces in a source image with the face of another individual, effectively replacing the facial features of the target person while retaining the original facial expressions. The use of deep learning methods for identity replacement can be traced back to the emergence of Deepfakes in 2017 [
14]. Deepfakes employed an autoencoder architecture comprising an encoder–decoder pair, where the encoder extracts latent facial features and the decoder reconstructs the target face. Korshunova et al. [
15] utilized a fully convolutional network along with style transfer techniques, and adopted multiple loss functions with variation regularization to generate realistic images. However, these approaches necessitate a substantial amount of data for both the source and target individuals for paired training, making the training process time-consuming.
In an effort to enhance the efficiency of Deepfakes, Zakharov et al. [
16] proposed GAN-based few-shot or one-shot learning to generate realistic talking-head videos from images. Various studies have focused on extensive meta-learning using large-scale video datasets over extended periods. Additionally, self-supervised learning methods and the generation of forged identities based on independently encoded facial features and annotations have been explored. Zhu et al. [
17] extended the latent spaces to preserve more facial details and employed StyleGAN2 [
18] to generate high-resolution swapped facial images.
- 2.
Expression reenactment
Expression reenactment involves transferring the facial expressions, gestures, and head movements of the source person onto the target person while preserving the identity of the target individual. These operations aim to modify facial expressions while synchronizing lip movements to create fictional content. Techniques such as 3D face reconstruction and GAN architectures were employed to capture head geometry and motions. Thies et al. [
19] introduced 3D facial modeling combined with image rendering, allowing for the real-time transfer of spoken expressions captured by a regular web camera to the face of the target person.
While GAN-based methods can generate realistic images, achieving highly convincing reenactment for unknown identities requires substantial training data. Kim et al. [
20] proposed fusing spatial–temporal encoding and conditional GANs (cGANs) in static images to synthesize target video avatars, incorporating head poses, facial expressions, and eye movements, resulting in highly realistic scenes. Other research has explored fully unsupervised methods utilizing dual cGANs to train emotion–action units for generating facial animations from single images.
Recent advancements include few-shot or one-shot facial expression reenactment techniques, alleviating the training burden on large-scale datasets. These approaches adopt strategies such as image attention analysis, target feature alignment, and landmark transformation to prevent quality degradation due to limited or mismatched data. Such methods eliminate the need for additional identity-adaptive fine-tuning, making them suitable for the practical applications of Deepfake. Fried et al. [
21] devised content-based editing methods to fabricate forged speech videos, modifying speakers’ head movements to match the dialogue content.
- 3.
Face synthesis
Facial synthesis primarily revolves around the creation of entirely new facial images and finds applications in diverse domains such as video games and 3D modeling. Many facial synthesis methods leverage GAN models to enhance resolution, image quality, and realism. StyleGAN [
22] elevated the image resolution initially introduced by ProGAN [
23], and subsequent improvements were made with StyleGAN2 [
18], which effectively eliminated artifacts to further enhance image quality.
The applications of GAN-based facial synthesis methods are varied, encompassing facial attribute translation [
22,
24], the combination of identity and attributes, and the removal of specific features. Some synthesis techniques extend to virtual makeup trials, enabling consumers to virtually test cosmetics [
25] without the need for physical samples or in-person visits. Additionally, these methods can be applied to synthesis operations involving the entire body; for instance, DeepNude utilized the Pix2PixHD GAN model [
26] to patch clothing areas and generate fabricated nude images.
- 4.
Facial attribute manipulation
Facial attribute manipulation, also known as facial editing or modification, entails modifying facial attributes like hair color, hairstyle, skin tone, gender, age, smile, glasses, and makeup. These operations can be considered as a form of conditional partial facial synthesis. GAN methods, commonly used for facial synthesis, are also employed for facial attribute manipulation. Choi et al. [
24] introduced a unified model that simultaneously trains multiple datasets with distinct regions of interest, allowing for the transfer of various facial attributes and expressions. This approach eliminates the need for additional cross-domain models for each attribute. Extracted facial features can be analyzed in different latent spaces, providing more precise control over attribute manipulation within facial editing. However, it is worth noting that the performance may degrade when dealing with occluded faces or when the face lies outside the expected range.
- 5.
Hybrid approaches
A potential trend is emerging wherein different Deepfake techniques are amalgamated to form hybrid approaches, rendering them more challenging to identify. Nirkin et al. [
27] introduced a GAN model for real-time face swapping, incorporating a fusion of reenactment and synthesis. Some approaches may involve the use of two separate Variational Autoencoders (VAEs) to convert facial features into latent vectors. These vectors are then conditionally adjusted for the target identity. These methods facilitate the application of multiple operations to any combination of two faces without requiring retraining. Users have the flexibility to freely swap faces and modify facial parameters, including age, gender, smile, hairstyle, and more. More advanced and sophisticated methods exist. [
28] adopted a multimodal fusion approach to deal with fake news to provide further protection mechanisms for social media platforms.
2.3. Deepfake Detection
In response to the threat brought by Deepfake, several potential solutions have been proposed to aid in identifying whether an examined video has undergone Deepfake operations.
Traditional Deepfake classifiers are typically trained directly on both real and manipulated images or video frames, employing methods such as dual-stream neural networks [
29], MesoNet [
30], CapsuleNet [
31], and Xception-Net [
32]. However, with the increasing diversity of recent Deepfake methods, direct training on images or video frames is deemed insufficient to handle the wide variety of manipulation techniques. It has been noted that certain details in manipulated images or video frames can still exhibit flaws, including decreased image quality, abnormal content deformations, or fragmented edges. Some methods have leveraged the inconsistencies arising from imperfections in Deepfake images/videos to analyze biological cues such as blinking, head poses, skin textures, iris patterns, teeth colors, etc. Additionally, abnormalities in facial deformation or mouth movements may be utilized to distinguish between real and fake instances. Others decompose facial images to extract details and combine them with the original face to identify key clues. Frequency domain approaches have also been explored. Li et al. [
33] developed an adaptive feature extraction module to enhance separability in the embedding space. Liu et al. [
34] combined spatial images and phase spectra to address under-sampling artifacts and enhance detection effectiveness.
Attention or segmentation are common strategies in Deepfake classifiers. For instance, attention mechanisms may be employed to highlight key regions for forming improved feature maps in classification tasks or focusing on blended regions to identify manipulated faces without relying on specific facial operations. Nguyen et al. [
35] designed a multi-task learning network to locate manipulated areas in forged facial images simultaneously.
- 2.
Video-level detection
The objective is to ascertain whether an entire video comprises manipulated content rather than merely detecting individual frames. These methods typically utilize the temporal information and dynamic features of the video to distinguish between authentic and manipulated content. Approaches relying on temporal consistency determination analyze the temporal relationships and motion patterns among video frames. Güera et al. [
36] employed a blend of CNNs and recurrent neural networks (RNNs), where CNNs extract frame features and RNNs identify temporal inconsistencies. Motion-feature-based methods examine motion patterns and dynamic features within videos, as real and manipulated videos may display distinct motion behaviors. Analyzing motion patterns, optical flows, and motion consistency helps to identify abnormal patterns in manipulated videos. Refs. [
37,
38] utilized blinking and head pose changes to discern between real and manipulated content. Spatial-feature-based methods utilize texture structures, spectral features, edge sharpening, and other spatial characteristics of video frames to detect potential synthesis artifacts in manipulated videos. It is essential to note that video-level Deepfake detection methods may demand increased computational resources and execution time. Moreover, due to the continuous evolution of Deepfake techniques, detection methods need regular updates to counter new manipulation attacks.