Next Article in Journal
An Experimental Electronic Board ADF339 for Analog and FPGA-Based Digital Filtration of Measurement Signals
Previous Article in Journal
Perceptual Image Quality Prediction: Are Contrastive Language–Image Pretraining (CLIP) Visual Features Effective?
Previous Article in Special Issue
Enhancing Object Detection in Smart Video Surveillance: A Survey of Occlusion-Handling Approaches
 
 
Article
Peer-Review Record

UFCC: A Unified Forensic Approach to Locating Tampered Areas in Still Images and Detecting Deepfake Videos by Evaluating Content Consistency

Electronics 2024, 13(4), 804; https://doi.org/10.3390/electronics13040804
by Po-Chyi Su 1,*, Bo-Hong Huang 1 and Tien-Ying Kuo 2,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Electronics 2024, 13(4), 804; https://doi.org/10.3390/electronics13040804
Submission received: 10 January 2024 / Revised: 5 February 2024 / Accepted: 16 February 2024 / Published: 19 February 2024
(This article belongs to the Special Issue Image/Video Processing and Encoding for Contemporary Applications)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

1.     Restructure the abstract to include essential elements: Background of the work, Hypothesis, Methods applied, Obtained results, and Conclusion.

2.     The CNN model in Figure 2 is unclear. Redrawing it again and clarifying the filter size, input size, and number of parameters is recommended.

3.     The dataset used and how to use them should be explained.

4.     The hyperparameter optimization results should be discussed.

5.     The preprocessing stage should be explained to convert the videos into frames and then how to input them into the proposed model.

6.     The references should support all formulas or equations used in this study.

7.     The conclusion should contain a brief description of the obtained results.

Comments on the Quality of English Language

Should be revised

Author Response

Dear Reviewer,

We really appreciate your suggestions, which certainly make our paper read much better. Thank you very much.

Sincerely
Pochyi

1. Restructure the abstract to include essential elements: Background of the work, Hypothesis, Methods applied, Obtained results, and Conclusion.
Thank you for the suggestion. The abstract has been revised as follows:
Image inpainting and Deepfake techniques have the potential to drastically alter the meaning of visual content, posing a serious threat to the integrity of both images and videos. Addressing this challenge requires the development of effective methods to verify the authenticity of investigated visual data. This research introduces UFCC (Unified Forensic Scheme by Content Consistency), a novel forensic approach based on deep learning. UFCC can identify tampered areas in images and detect Deepfake videos by examining content consistency, assuming that manipulations can create dissimilarity between tampered and intact portions of visual data. The term "Unified" signifies that the same methodology is applicable to both still images and videos. Recognizing the challenge of collecting a diverse dataset for supervised learning due to various tampering methods, we overcome this limitation by incorporating information from original or unaltered content in the training process rather than relying solely on tampered data. A neural network for feature extraction is trained to classify imagery patches, and a Siamese network measures the similarity between pairs of patches. For still images, tampered areas are identified as patches that deviate from the majority of the investigated image. In the case of Deepfake video detection, the proposed scheme involves locating facial regions and determining authenticity by comparing facial region similarity across consecutive frames. Extensive testing is conducted on publicly available image forensic datasets and Deepfake datasets with various manipulation operations. The experimental results highlight the superior accuracy and stability of the UFCC scheme compared to existing methods.

2. The CNN model in Figure 2 is unclear. Redrawing it again and clarifying the filter size, input size, and number of parameters is recommended.
Thank you for the suggestion. We redrew Fig. 2 and included the resulting sizes of each step. Some descriptions have been added to Page 7 to make the explanations clearer:
It is worth noting that all convolution layers have a stride of one and utilize boundary reflection in their convolution operations. This approach ensures that the width and height of the data remain unchanged. Any alterations in spatial resolution are solely influenced by the pooling operations, whether it be Max. pooling or Avg. pooling. For a clearer understanding, the resulting dimensions of the data are also provided in Figure 2.

3. The dataset used and how to use them should be explained.
Some descriptions related to the training dataset have been added in Sec. 4.2.
When training the feature extractor as described in Section 3.2, we compiled vari-ous camera types from datasets including the VISION dataset [45], the Camera Model Identification Challenge (CMIC) dataset [46], and our collected dataset. After removing duplicates and conducting thorough tests and comparisons, we identified the 40 most suitable classes (24 from VISION, 8 from CMIC, and 8 from our own collection) for our training dataset. For the similarity network, outlined in Section 3.3, we utilized 25 camera-type datasets as our training data, which are from the Dresden dataset [47]. The difference in the selected datasets compared to those used for the feature extractor aims to enable the model to learn from the 40 camera-type classes first. Subsequently, the model can be fine-tuned using the unknown 25 classes during the similarity network learning. The model’s capabilities of dealing with unknown camera types can be further enhanced. It should be noted that all the images in these datasets are used for training since the investigated images in the experiments will not be restricted to the dataset containing camera-type information. Besides, it is not easy for us to collect more images related to specific camera types.

4. The hyperparameter optimization results should be discussed.
We discussed the number of frames used in detecting Deepfake videos in Tables 7 and 8. We agree that many hyperparameters exist in the proposed scheme but varying some values do not show significantly different results. Therefore, we simply list the values in the paper as the reference for the readers.

5. The preprocessing stage should be explained to convert the videos into frames and then how to input them into the proposed model.
For the videos, we do extract the videos into frames so that the subsequent investigation can be facilitated in the experiment. Although the video frames occupy a large storage space, the process is easier and more accessible than working directly with video files. In real applications, it is suggested to use a buffer to extract video frames for examining the frame content.

6. The references should support all formulas or equations used in this study.
The references related to the evaluation metrics have been added.

7. The conclusion should contain a brief description of the obtained results.
Thank you for the reminder. The conclusion has been revised to include the obtained results.
This research introduces a robust image and video tampering detection scheme, UFCC, aimed at identifying manipulated regions in images and verifying video authenticity based on content consistency. The scheme utilizes camera models to train the feature extractor through deep learning, and a Siamese network to assess the similarity between examined patches. Addressing image tampering involves implementing patch selection strategies for approximating manipulated areas, followed by refinement through FBA Matting. In Deepfake video detection, facial regions are initially extracted, and similar detection methods are applied to patches in frames to determine whether the investigated video is a Deepfake forgery.
The UFCC scheme, when compared to existing methods, not only demonstrates superior performance but also distinguishes itself by presenting a comprehensive detection methodology capable of handling both image and video manipulation scenarios. In detecting image inpainting, UFCC surpasses existing work on the DSO-I dataset [48] with higher mAP[66], cIoU[67], MCC[68], and F1[69] values of 0.58, 0.83, 0.56, and 0.63, respectively. For the detection of Deepfake videos, UFCC excels on the DF[12], F2F[17], and NT[58] tests with accuracy rates of 0.982, 0.984, and 0.973, respectively. Although the accuracy of the FS [57] testing is slightly lower at 0.928, it's worth noting that FS contains less realistic images and is considered a milder threat. Notably, our training data excludes manipulated images or Deepfake datasets, enhancing the proposed scheme's generalization capability. This absence of specific target tampering operations makes the UFCC scheme more flexible and adaptive, enabling it to handle a wide range of content manipulations, including mixed scenarios.

Reviewer 2 Report

Comments and Suggestions for Authors

The authors proposed a deep fake image/video detection method using a content-consistent evaluation. Though the paper is interesting, I have some comments, as follows:.

(1) There are many sophisticated recent methods of DeepFake text/image/video detection. The authors should discuss some of them (form top journals) in the literature reviews. One such paper is  “QMFND: A quantum multimodal fusion-based fake news detection model for social media,” Information Fusion, vol. 104, 102172, pp. 1-11, April 2024.

(2) It is not clear whether the images are raw images or not. What happens if we use JPEG images? What happens if we change the Q factor of JPEG images (“Passive copy move image forgery detection using undecimated dyadic wavelet transform,” Digital Investigation, Elsevier, vol.9, issue 1, pp. 49-57, 2012.)? What happens if we add noises at different SNRs?

(3) Experiments should be comprehensive. For example, at different SNRs, and at different Qs. 

(4) What is the complexity of the proposed method?

Comments on the Quality of English Language

Moderate corrections are needed.

Author Response

Dear Reviewer,

We really appreciate your constructive suggestions, which certainly make our paper read much better. Thank you very much.

Sincerely
Pochyi

================

The authors proposed a deep fake image/video detection method using a content-consistent evaluation. Though the paper is interesting, I have some comments, as follows:

1. There are many sophisticated recent methods of DeepFake text/image/video detection. The authors should discuss some of them (form top journals) in the literature reviews. One such paper is “QMFND: A quantum multimodal fusion-based fake news detection model for social media,” Information Fusion, vol. 104, 102172, pp. 1-11, April 2024.Thank you for the reminder. This excellent work has been included in the reference section of the paper.

2. It is not clear whether the images are raw images or not. What happens if we use JPEG images? What happens if we change the Q factor of JPEG images (“Passive copy move image forgery detection using undecimated dyadic wavelet transform,” Digital Investigation, Elsevier, vol.9, issue 1, pp. 49-57, 2012.)? What happens if we add noises at different SNRs?Right now we do use raw images without compression. The scheme may not perform as well when moderate JPEG is applied. We leave the strategy for dealing with lossy compression to future work.

3. Experiments should be comprehensive. For example, at different SNRs, and at different Qs.
We used raw video frames in the experiments. (Deepfake datasets usually include raw frames and encoded frames). The Future Work discusses better ways that may perform better when dealing with encoded videos.

4. What is the complexity of the proposed method?
The complexity is not higher than ordinary deep learning applications. In fact, the complexity in the proposed scheme is not very high as we do not use a modern GPU (2080Ti in the experiments). In our opinion, investigating the authenticity of imagery data doesn’t require real-time processing, thus the complexity issue may not be our top priority, and we focus on accuracy in this study. The most time-consuming process in the experiments is dealing with video frames, therefore we decided to extract all the frames rather than working with video files.

Reviewer 3 Report

Comments and Suggestions for Authors

In this paper, the deep learning network model is proposed to detect fake content in images and videos. The title of this paper suggests that the algorithm detects fake images and videos. In fact, it does even more - it detects the fake content within images and videos. So, I suggest maybe to reedit the title of this paper.

Another remark is editorial: In Fig.8 caption should be: (a) overlapped patches, (b) non-overlapped patches. 

In my opinion, this paper is of high quality and is worth publishing. The topic is very important in these days.

Comments on the Quality of English Language

I have no particular comments.

Author Response

Dear Reviewer,

We really appreciate your constructive suggestions, which certainly make our paper read much better. Thank you very much.

Sincerely
Pochyi

====================

  1. In this paper, the deep learning network model is proposed to detect fake content in images and videos. The title of this paper suggests that the algorithm detects fake images and videos. In fact, it does even more - it detects the fake content within images and videos. So, I suggest maybe to reedit the title of this paper.
    The title of the paper has been re-edited to “UFCC: A Unified Forensic Approach to Locating Tampered Areas in Still Images and Detecting Deepfake Videos by Evaluating Content Consistency”, which should better cover the topics or issues discussed in the paper.
  2. Another remark is editorial: In Fig.8 caption should be: (a) overlapped patches, (b) non-overlapped patches.
    Thank you for the reminder. We have modified the caption accordingly.
  3. In my opinion, this paper is of high quality and is worth publishing. The topic is very important in these days.
    Thank you for the compliment. We hope that our work can make some contributions to ensuring the integrity of digital content.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The revised manuscript could be accepted.

Reviewer 2 Report

Comments and Suggestions for Authors

I am satisfied with the revision.

Back to TopTop