Next Article in Journal
An AR Map Virtual–Real Fusion Method Based on Element Recognition
Previous Article in Journal
Remote Sensing-Based Yield Estimation of Winter Wheat Using Vegetation and Soil Indices in Jalilabad, Azerbaijan
 
 
Article
Peer-Review Record

Dynamic Fusion Technology of Mobile Video and 3D GIS: The Example of Smartphone Video

ISPRS Int. J. Geo-Inf. 2023, 12(3), 125; https://doi.org/10.3390/ijgi12030125
by Ge Zhu, Huili Zhang, Yirui Jiang, Juan Lei, Linqing He and Hongwei Li *
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3: Anonymous
Reviewer 4:
Reviewer 5:
ISPRS Int. J. Geo-Inf. 2023, 12(3), 125; https://doi.org/10.3390/ijgi12030125
Submission received: 18 December 2022 / Revised: 7 March 2023 / Accepted: 9 March 2023 / Published: 14 March 2023

Round 1

Reviewer 1 Report

The manuscript deals with the problem of linking 3DGIS with video obtained from a smartphone. The development of mobile devices in combination with GPS has opened the opportunity to implement recorded videos in 3DGIS. 

- The objectives are clearly defined in the manuscript.

- How does the shape of a building, its articulation, the number of edges, balconies and windows, or curves affect the accuracy of the model?

- Have you considered different types of buildings for comparison, from simple buildings to complex buildings in shape?

- Is it also possible to capture moving objects?

 

Author Response

We greatly appreciate your time in posting constructive comments and useful suggestions. This has greatly improved the quality of the manuscript and enabled us to improve it. We have discussed the comments and suggestions you gave in depth and detail and have made changes accordingly. We strongly hope that the resubmitted manuscript will meet your criteria.

 ï¼ˆPhotos cannot be viewed in this window, please see the attachment for details.)

Point 1: How does the shape of a building, its articulation, the number of edges, balconies and windows, or curves affect the accuracy of the model? 

 

Response 1: The method in this paper draws on the principle of shadow mapping, dynamically constructs a depth camera based on the position and pose data corresponding to the video frame, puts a shadow inside the view cone and obtains the depth image and transformation matrix of the view cone at the point light source, inverts the slice elements captured by the 3DGIS window camera into the depth value of the depth camera space to do depth judgment, and projects the video frame texture on the 3DGIS model based on the texture mapping method on the 3DGIS model. This method replaces the texture on the 2D canvas after the browser rendering. Therefore, the shape details of the building can only affect the accuracy of the 3D model, and the accuracy of the fusion of the projection is indirectly affected by the model error. The direct factors that affect the accuracy of the projection model are sensor error, pose resolution error, lens distortion error and timing correspondence error. The impact of model accuracy is partially discussed in the discussion section of the new chapter 4 of this paper, as shown in the following:

 

 Notably, the lens distortion is not discussed in this paper.It is because the sensor error and attitude resolution error have a greater impact on the accuracy of the fusion effect, and the timing correspondence error and lens aberration have a smaller negative impact. The impact of timing correspondence error and lens distortion on the fusion effect is not considered until a reasonable optimization of the larger errors is performed. Therefore, the follow-up work of this study in terms of accuracy should focus on the optimization of cell phone sensor error and attitude resolution error. In terms of application, the focus should be on accessing the real-time video captured by cell phones and 3DGIS for fusion. 

 

 

Point 2: Have you considered different types of buildings for comparison, from simple buildings to complex buildings in shape?

 

Response 2: From the projection principles in this paper, we can see that the complexity of the 3D model does not have much influence on the results of the projection experiments. We have experimented with the projection of different buildings using test data, and the projection results obtained are quite good. The experimental results can be seen in the following figures, but they are not put into the manuscript because there is no geographical connection between the test data.

 

 

 

 

 

 

Point 3: Is it also possible to capture moving objects?

 

Response 3: It is possible to capture objects in video images using target recognition technology and return the timestamp of that frame image, contact the data in the location table and label it in 3DGIS. By recognizing the target of key frame image frame by frame and labeling it in 3DGIS, the capture and tracking of moving objects can be achieved. The prospect of the application of this research in moving object capture is added to the discussion in the new chapter 4 of this paper.

 

‘ Although the video lacks real-time and the experiment does not achieve high-precision dynamic fusion. However, the research still has a broad application prospect. In the law enforcement supervision of land resources, videos can be taken with mobile devices and batch transferred into 3DGIS for land identification, management of illegal houses, monitoring of green area, etc. In the law enforcement of urban security and police, the dynamic fusion management of PTZ Cameras for security management can be performed. Target recognition of dynamic targets in law enforcement recorders, capture objects in video frames, locate them in 3DGIS and perform retrospective restoration of target trajectories. In the application to ubiquitous data, the use of short video software such as TikTok can be used for tracking and evaluation of captured emergencies, vehicle camera video for monitoring the health of roads and city parts, etc.

Author Response File: Author Response.docx

Reviewer 2 Report

The paper is a presentation of a method for fusing 3D-GIS and mobile video. The purpose of the fusion is the possibility to use the content of the video for dynamic texturing. This method is analyzed quantitatively (accuracy and time efficiency) by dividing the main idea into several different categories. The results of this study is a ranking of the different setups where one specific configuration is found to be more suitable than the others when one of the demands of this system is to be able to meet requirements of the human eyesight.

I find the paper very well written and structured. The scientific level is high and the actuality of the subject couldn’t be more timely. The area of this specific research is very dependent on how camera technology of smart phones is developing and right now cameras in phones are very strong and defining the mobile video market.

Methods used are relevant and explained in a good way. I find illustrations adequate and communicate the structure with a clear message. Algorithms are used with a nice symbology and very clear mathematical notation.

Results are explained through the use of a case where the 3D building model are textured with the mobile video with the use of pictures and graphical notation on the pictures. Supporting illustrations strengthens the message and fulfills the intended communication.

In some sections the text and figures could use more space between each other. English language is used in a good way and there is only minor spelling errors and a few grammars.

I recommend that this paper is accepted for publication with minor revisions.

Author Response

We are very grateful for your time in reading my manuscript and express our sincere appreciation for your recognition of our work. Your constructive comments and useful suggestions have been discussed in depth and detail and changes have been made accordingly. We strongly hope that the resubmitted manuscript will meet your criteria.

(Photos cannot be viewed in this window, please see the attachment for details.)

Point 1: In some sections the text and figures could use more space between each other.

 

Response 1: We have adjusted the spacing between text and graphics to make it more comfortable to read. 

 

 

 

 

 

Point 2: English language is used in a good way and there is only minor spelling errors and a few grammars.

 

Response 2: We rechecked the entire manuscript and made appropriate adjustments and changes to some words and phrases.For example:

 

 

Author Response File: Author Response.docx

Reviewer 3 Report

The fusion of mobile video and 3D GIS is an important topic in Video GIS. This paper integrated and compared various algorithms to realize the dynamic fusion of smartphone video and 3DGIS. The results seem promising to expand video sources fused with a virtual scene and play a role in practical applications. However, some issues should be considered in this paper as follows:

 

1. Some common sense and mature technologies such as the model loading process of Cesium, triple interpolation, camera view frustum, and WebGL rendering pipeline, should be simplified, and the authors’ OWN method should be elaborated. The projection texture mapping method (Figure 5) seems similar to the Depth Test or Shadow Map pipeline in a 3D engine.

 

2. The lens distortion is not discussed when projecting the video texture on the 3D scene. What impact will it have on the fusion results, and how to deal with it?

 

3. Is the experiment conducted in real-time or offline? If the smartphone video is transmitted to the 3D system in real-time, the transmission delay should be taken into consideration. If not, however, its application prospect will be limited.

 

4. As the mobile video and 3D GIS are fused dynamically, the experimental results especially the fusion scenes (screenshots) are too few. It would be better if the authors share some demo videos or screen recordings and provide the link in the Data Availability Statement.

Author Response

We are very grateful for your time in reading my manuscript and express our sincere appreciation for your recognition of our work. Your constructive comments and useful suggestions have been discussed in depth and detail and changes have been made accordingly. We strongly hope that the resubmitted manuscript will meet your criteria.

(Photos cannot be viewed in this window, please see the attachment for details.)

Point 1: Some common sense and mature technologies such as the model loading process of Cesium, triple interpolation, camera view frustum, and WebGL rendering pipeline, should be simplified, and the authors’ OWN method should be elaborated. The projection texture mapping method (Figure 5) seems similar to the Depth Test or Shadow Map pipeline in a 3D engine. 

 

Response 1: We have streamlined the model loading process, triple interpolation, camera view truncated body and WebGL rendering pipeline sections. The dynamic fusion method from Chapter 2 has also been described in detail and a flowchart has been added. The method draws on the principles of shadow mapping, which involves dropping shadows on the model surface within the view cone and adding a single depth detection. The mapped shadows are then replaced with the textures of the video images by the texture mapping method. The specific modifications are shown below:

 

 

 

 

 

 

 

 

 

Point 2: The lens distortion is not discussed when projecting the video texture on the 3D scene. What impact will it have on the fusion results, and how to deal with it?

 

Response 2: Lens distortion is actually a general term for the perspective distortion inherent in an optical lens, that is, the distortion caused by perspective. Lens distortion can lead to distorted video images, color distortion, which affects the accuracy of video projection. Lens distortion can be used to reduce the projection error caused by distortion by using the Flexible New Technique for Camera Calibration method to find the radial distortion coefficient which has a greater impact in distortion. However, since the sensor error and pose solving error are relatively large and have a large impact on the projection effect, the impact caused by the lens aberration error should be discussed after the optimization of the sensor error and pose solving error. I also make additional explanation in the discussion in Chapter 4.The paragraphs of the article are as follows:

 

 Notably, the lens distortion is not discussed in this paper.It is because the sensor error and attitude resolution error have a greater impact on the accuracy of the fusion effect, and the timing correspondence error and lens aberration have a smaller negative impact. The impact of timing correspondence error and lens distortion on the fusion effect is not considered until a reasonable optimization of the larger errors is performed. Therefore, the follow-up work of this study in terms of accuracy should focus on the optimization of cell phone sensor error and attitude resolution error. In terms of application, the focus should be on accessing the real-time video captured by cell phones and 3DGIS for fusion.

 

 

Point 3: Is the experiment conducted in real-time or offline? If the smartphone video is transmitted to the 3D system in real-time, the transmission delay should be taken into consideration. If not, however, its application prospect will be limited.

 

Response 3: The experiments were performed offline and the phone was responsible for collecting the data. Additions have been made to the manuscript for clarification. It is true that it will impose a minor limitation on the application prospects, but there are still some application prospects. The application prospects of the existing research results are discussed in a supplementary chapter 4 of this paper. The relevant modifications are as follows.

 

 

 

 

Although the video lacks real-time and the experiment does not achieve high-precision dynamic fusion. However, the research still has a broad application prospect. In the law enforcement supervision of land resources, videos can be taken with mobile devices and batch transferred into 3DGIS for land identification, management of illegal houses, monitoring of green area, etc. In the law enforcement of urban security and police, the dynamic fusion management of PTZ Cameras for security management can be performed. Target recognition of dynamic targets in law enforcement recorders, capture objects in video frames, locate them in 3DGIS and perform retrospective restoration of target trajectories. In the application to ubiquitous data, the use of short video software such as TikTok can be used for tracking and evaluation of captured emergencies, vehicle camera video for monitoring the health of roads and city parts, etc.

 

Point 4: As the mobile video and 3D GIS are fused dynamically, the experimental results especially the fusion scenes (screenshots) are too few. It would be better if the authors share some demo videos or screen recordings and provide the link in the Data Availability Statement.

 

Response 4: In the third chapter of dynamic fusion experiment, nine new dynamic fusion scene screenshots are added as follows. Due to some reasons such as the project, we will not release some demonstration videos publicly, so sorry for that.

 

 The dynamic projection algorithm is applied for the projection,The experimental results are shown in Figure 11, which shows the dynamic projection fusion effect for different viewpoints and at different moments. It can be clearly seen that the position of the viewpoint and the pose of the viewing frustum are changing with time. It can also be seen that there are some deviations in the accuracy of the projection affected by various errors.’

 

 

 

(a)

(b)

(c)

 

 

 

(d)

(e)

(f)

 

 

 

(g)

(h)

(i)

 

Author Response File: Author Response.docx

Reviewer 4 Report

Thank you for submitting this article, well-done. In my opinion it should be published.

Author Response

We are very grateful for your time in reading my manuscript and express our sincere appreciation for your recognition of our work.

Reviewer 5 Report

The paper proposes the method for texture mapping in 3D GIS based on mobile videos data. The paper is concisely written and well structured. However, some additional explanations regarding implementation of the approach are required. Is shooting of the video done and processed in real time? Where the parameters of the position and smartphone attitude are recorded? How the textures of a 3D object obtained from the video are imported into Cesium?

 

The authors should emphasise contribution of the research compared to other similar approaches.

 

Author Response

We are very grateful for your time in reading my manuscript and express our sincere appreciation for your recognition of our work. Your constructive comments and useful suggestions have been discussed in depth and detail and changes have been made accordingly. We strongly hope that the resubmitted manuscript will meet your criteria.

 ï¼ˆPhotos cannot be viewed in this window, please see the attachment for details.)

Point 1: Is shooting of the video done and processed in real time? 

 

Response 1: The experiments were performed offline and the phone was responsible for collecting the data. Additions have been made to the manuscript for clarification. It is true that it will impose a minor limitation on the application prospects, but there are still some application prospects. The application prospects of the existing research results are discussed in a supplementary chapter 4 of this paper. The relevant modifications are as follows.

 

 

 

 

 

Although the video lacks real-time and the experiment does not achieve high-precision dynamic fusion. However, the research still has a broad application prospect. In the law enforcement supervision of land resources, videos can be taken with mobile devices and batch transferred into 3DGIS for land identification, management of illegal houses, monitoring of green area, etc. In the law enforcement of urban security and police, the dynamic fusion management of PTZ Cameras for security management can be performed. Target recognition of dynamic targets in law enforcement recorders, capture objects in video frames, locate them in 3DGIS and perform retrospective restoration of target trajectories. In the application to ubiquitous data, the use of short video software such as TikTok can be used for tracking and evaluation of captured emergencies, vehicle camera video for monitoring the health of roads and city parts, etc.

 

Point 2: Where the parameters of the position and smartphone attitude are recorded?

 

Response 2: The parameters of position and smartphone pose are recorded in a table in the database. When fusion is required, it is imported into the browser for interpolation and stored as a Parameter class. The dynamic correspondence fusion is achieved by cyclic traversal together with the video frame images. In this paper, a description of it is added in section 2.2.3. and a flowchart supplement is made.

 

‘For dynamic projection of the video ground, the 3D model is first loaded on Cesium. Then the video image is imported and parsed, and the image texture is stored frame by frame into the Texture class to pass into the shader and wait for texture mapping. Finally, the timestamp, solved pose data and sensor data such as GPS are read into the browser as a table for spatio-temporal interpolation. After the interpolation is completed, the viewing frustum drop shadow is constructed based on the position and pose and related internal and external parameters. The depth image and transform clipping matrix within the viewing frustum are obtained and passed to the shader for texture mapping of a single image. After one mapping is completed, the position and pose parameters of the next frame are passed in. The depth image and transform clipping matrix of the viewing frustum are updated and the texture mapping is performed for the next frame to achieve dynamic projection fusion of the moving video, as shown in Figure 7.’

 

 

 

 

 

 

 

Point 3: How the textures of a 3D object obtained from the video are imported into Cesium?

 

Response 3: In this paper, we first import the video data into the browser and parse the video frames to store them as Texture class, and import them into the shader as Uniform variables along with the related parameters. The texture mapping method calculates the coordinates of the slice in the video image and extracts the texture to assign to the slice. In this way, the texture information of the video image is imported to the surface slice element of the cesium 3D model. The description is added in chapter 2.2.3. and the flowchart is supplemented with a description. Detailed information about the modifications is given in point 2.

 

Point 4: The authors should emphasise contribution of the research compared to other similar approaches.

 

Response 4: The work in this paper is mainly enriched for dynamic fusion methods for mobile video. A discussion section in Chapter 4 has been added to the manuscript to provide a cursory discussion of the contributions and application prospects of the research in this paper, and nine images of experimental results are added. The specific contributions and discussions are as follows:

 

‘This paper proposed a dynamic fusion algorithm for mobile video with multi-system integration. Compared with previous studies, the contribution is as follows:

(1)A complete set of theoretical methods for dynamic fusion of mobile video and 3D GIS is proposed.

(2)Comparing the cell phone to a aircraft, introducing the Air Attitude Resolution System (AHRS) to solve the cell phone attitude and comparing the accuracy and efficiency of the solution.

(3)The proposed texture mapping method for dynamically building depth cameras meets the requirements of comfortable viewing by human eyes while dynamically fusing, and expands the video sources fused with 3DGIS.’

 

‘Although the video lacks real-time and the experiment does not achieve high-precision dynamic fusion. However, the research still has a broad application prospect. In the law enforcement supervision of land resources, videos can be taken with mobile devices and batch transferred into 3DGIS for land identification, management of illegal houses, monitoring of green area, etc. In the law enforcement of urban security and police, the dynamic fusion management of PTZ Cameras for security management can be performed. Target recognition of dynamic targets in law enforcement recorders, capture objects in video frames, locate them in 3DGIS and perform retrospective restoration of target trajectories. In the application to ubiquitous data, the use of short video software such as TikTok can be used for tracking and evaluation of captured emergencies, vehicle camera video for monitoring the health of roads and city parts, etc.’

 

 The dynamic projection algorithm is applied for the projection,The experimental results are shown in Figure 11, which shows the dynamic projection fusion effect for different viewpoints and at different moments. It can be clearly seen that the position of the viewpoint and the pose of the viewing frustum are changing with time. It can also be seen that there are some deviations in the accuracy of the projection affected by various errors.’

 

 

 

(a)

(b)

(c)

 

 

 

(d)

(e)

(f)

 

 

 

(g)

(h)

(i)

 

 

 

Author Response File: Author Response.docx

Round 2

Reviewer 3 Report

no comments

Author Response

We are very grateful for your time in reading my manuscript and express our sincere appreciation for your recognition of our work.

Reviewer 5 Report

The authors have appropriately answered to the remarks from the previous round of review.

Author Response

We are very grateful for your time in reading my manuscript and express our sincere appreciation for your recognition of our work.

Back to TopTop