Next Article in Journal
Airborne Short-Baseline Millimeter Wave InSAR System Analysis and Experimental Results
Next Article in Special Issue
SCCMDet: Adaptive Sparse Convolutional Networks Based on Class Maps for Real-Time Onboard Detection in Unmanned Aerial Vehicle Remote Sensing Images
Previous Article in Journal
Future Scenarios of Urban Nighttime Lights: A Method for Global Cities and Its Application to Urban Expansion and Carbon Emission Estimation
Previous Article in Special Issue
WBIM-GAN: A Generative Adversarial Network Based Wideband Interference Mitigation Model for Synthetic Aperture Radar
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Learning Point Processes and Convolutional Neural Networks for Object Detection in Satellite Images

1
Inria, Université Côte d’Azur, 06902 Sophia Antipolis, France
2
Airbus Defence and Space, 31400 Toulouse, France
*
Author to whom correspondence should be addressed.
Remote Sens. 2024, 16(6), 1019; https://doi.org/10.3390/rs16061019
Submission received: 31 January 2024 / Revised: 1 March 2024 / Accepted: 6 March 2024 / Published: 13 March 2024
(This article belongs to the Special Issue Pattern Recognition and Image Processing for Remote Sensing III)

Abstract

:
Convolutional neural networks (CNN) have shown great results for object-detection tasks by learning texture and pattern-extraction filters. However, object-level interactions are harder to grasp without increasing the complexity of the architectures. On the other hand, Point Process models propose to solve the detection of the configuration of objects as a whole, allowing the factoring in of the image data and the objects’ prior interactions. In this paper, we propose combining the information extracted by a CNN with priors on objects within a Markov Marked Point Process framework. We also propose a method to learn the parameters of this Energy-Based Model. We apply this model to the detection of small vehicles in optical satellite imagery, where the image information needs to be complemented with object interaction priors because of noise and small object sizes.

Graphical Abstract

1. Introduction

While object detection has been thoroughly studied for the last 20 years [1], the detection of small objects in optical satellite images remains a challenging task: With limited spatial resolution (around 0.5 m/pixel), objects such as cars can be only a few pixels wide. Moreover, this tiny-object scattering density varies greatly, and when closely packed, instances might be difficult to differentiate. However, as the geometrical configuration of one object is often dependent on its neighbors, we can leverage these priors to improve the detection results.
Methods based on convolutional neural networks (CNN) such as Faster R-CNN [2], YOLO [3], or RetinaNet [4] propose to detect objects in “natural” images. Some of these approaches first extract features through convolutions, then propose a series of boxes (anchors) that are refined by regression afterward [5,6]. The specificities of remote-sensing data (high number of small objects) motivate the use of anchor-free methods, relying on heatmap (probability of object center) inference for oriented vehicle detection [7,8,9], or ship detection [10] (here approximating the heatmap through a vector field).
In remote sensing, the sensor is intrinsically quite distant from the observed objects, implying a limited spatial resolution (usually measured in meters per pixel). This limited resolution, in turn, induces a limited amount of visual information to perform object detection. Given the sensor noise or atmospheric perturbations, remote-sensing detection methods need to be innovative to compensate for the limited signal. Approaches such as [11,12] use the temporal information from image time series to improve the detection of small objects. However, these are not suitable for detecting small static objects. To extract object orientation, Ref. [13] focuses on extracting fine-grained features while ignoring any inter-instance interactions. The interaction between nearby objects is key to analyzing a scene for human operators. However, those are often ignored in CNN-based approaches (at least not beyond a post-processing step such as non-maximum suppression) since CNN models are—by construction—most efficient at extracting texture and local information [14]. Meanwhile, long-range relationships are only caught when drastically increasing the depth of the network or by introducing dense connections. For instance, Ref. [15] models interactions through a cascade of Transformers at a great complexity cost, while [16] uses text-modal descriptors to introduce prior knowledge about the objects and their relations into the model.
Meanwhile, Point Processes [17] have been used to perform object detection while relying on a stochastic geometry model to jointly solve the detection and selection of objects with priors. Point Process approaches model the probability on the whole set of points in the image, thus taking into account the interactions between objects. These approaches rely on decomposing the Point Process density into a data term that measures the correspondence of the points against the image and some prior terms that measure the coherence of the point configuration itself. These approaches have shown good results for detection tasks in images with many small objects, such as biological imagery [18] or remote sensing for road [19,20] and building [21] extraction, or tree detection [22].
Previous methods make use of data terms built upon image contrast measures between the interior and exterior of the object [18,20,23] or image gradient [24], which perform great in images with clear contrast between objects and their background. However, in the case of optical satellite images, background diversity, variable visual aspects of objects, and inconsistent illumination make these contrast measures unreliable.
In this paper, we propose the incorporation of interaction models into object-detection methods while taking advantage of the capabilities of deep convolutional neural networks. Satellite images are inherently imperfect due to atmospheric disturbances, partial occlusions, and limited spatial resolution. To compensate for this lack of visual information, it becomes essential to incorporate prior knowledge about the layout of objects of interest.
Instead of increasing the complexity of the model by adding, for example, Transformers to a deep CNN architecture, we propose in this paper to combine CNN pattern extraction with the Point Process approach. The starting point of this approach is to use the output of a CNN as the data term for a Point Process model (Section 2.3). From the latter, we derive more efficient sampling methods for the Point Process that do not rely on application-specific heuristics (Section 2.4). Then, we propose to bridge the gap in terms of parameter estimation using modern learning techniques inspired by Energy-Based Models (Section 2.5). Ultimately, we introduce a novel scoring function that takes into account object interaction in measuring the confidence in the detections and enables explainable results (Section 2.6). The final part presents quantitative and qualitative results on optical satellite data (Section 3).

2. Materials and Methods

In this paper, we propose an object-detection method leveraging CNN-extracted information within a Point Process framework that models object interactions. First, we introduce the foundations for the Point Process on which our model is built, then detail the energy model, along with its sampling and parameter estimation method. We also propose a novel scoring method to evaluate the confidence in the output while considering the interactions and providing explainable results.

2.1. Materials: Datasets

Our application goal with Airbus Defense and Space (ADS) is the detection of small objects in images from satellites such as Pléiades, (Constellation Pléiades [https://pleiades.cnes.fr/en/PLEIADES/GP_systeme.htm, pleiades.cnes.fr, (accessed on 20 October 2023)]) or CO3D, (Constellation Optique 3D [https://www.eoportal.org/satellite-missions/co3d-constellation, eoportal.org, (accessed on 20 October 2023)]), which have a typical spatial resolution (or ground sampling distance) of 0.5 m per pixel.
To train the CNN and infer the model parameters θ , we compile DOTA 0.5 , a dataset of vehicles in remote-sensing images with 0.5 m per pixel resolution, from the DOTA dataset [25]. This dataset DOTA 0.5 is built by sub-sampling images from DOTA to the desired spatial resolution and keeping only small-vehicle and large-vehicle classes. The data consists of images from satellite or aerial sources of urban or countryside scenes with variable densities of objects (from isolated cars on country roads to dense parking lots). The images are labeled with sets of oriented rectangles for the objects of interest. We also compile a noisy version of DOTA 0.5 : DOTA 0.5 +noise, by adding a Gaussian noise on each image ( σ = 0.3 ) to test the resilience of the methods against noise. Moreover, we evaluate models on data provided by ADS. These aerial data are sub-sampled to the desired resolution, emulating the above-mentioned satellite characteristics. This dataset is unlabeled; thus, it will only serve to evaluate the models qualitatively and was not used for training.

2.2. Point Process for Object Detection

Point Processes model configurations of points (a finite non-ordered set y = { y 1 , , y n } of elements of S ) as realizations of a random variable Φ in the set of all possible configurations Y = n = 0 ( S × M ) n (with an arbitrary amount of points). Space S corresponds to the image space, and M to the mark space. The marks can be any random variable that describes the point beyond its location, from the radius of a circle to a discrete categorization of the object. Additionally, we denote n ( y ) the number of points in the configuration y and Y n the set of all n points configurations.
As we look to detect vehicles, we model points as rectangles described by the following marks: width y a , length y b , and angle y α as shown in Figure 1 (with the mark space defined as M = [ a min , a max ] × [ b min , b max ] × [ 0 , π ] ).
A Point Process Φ is described by its density h relative to the uniform Point Process [17]. The model of selection and interaction of points derives from an energy U, through a Gibbs density:
h y | X = 1 Z exp U y , X , θ ,
where X is the image, θ the parameters of the model, and Z is a normalizing constant. This constant is intractable due to the non-fixed dimension of Y . However, it can be shown that Z exists (The existence and finiteness of Z are sufficient as the following computations will consider ratios of h; thus Z will cancel out.) and thus h is properly defined as the energy per point is bounded (see Appendix A.2). In short, the energy function U measures the compatibility of the configuration y with the image X ; the lower the energy, the higher the compatibility (see Energy-Based Models [26]). It follows that the most compatible output y * minimizes the energy U (i.e., maximizes the density h).

2.2.1. Markovian Point Process

For most applications [18,20,21,23], the physical constraints of the system that is being modeled imply that the marginal density of a point y y should only depend on a neighborhood around it (e.g., vehicles only align with others within a limited distance). This property is formalized through the Markovianity of the Point Process.
Definition 1.
Point Process Φ is a Markov Point Process under the symmetric and reflexive relation∼if and only if, for every configuration y Y such that h ( y ) > 0 [17]:
  • x y , h ( x ) > 0
  • For every point u S , h ( { u } y ) / h ( y ) only depends on u and its neighbors according to ∼ (i.e., { y y , y u } ).
Theorem 1.
For a Point Process derived from the energy:
U ( y , X , θ ) = y y V ( y , N y y , X , θ ) ,
where N y y = { y y , y y , y y } is the set of neighbors of y for relation ∼. Then, the Point Process is Markovian for relation 2 with y 2 y u S × M , y u y .
The proof is provided in Appendix A.1.
In short, if the energy V ( y , N y y , X , θ ) of a point y depends on its neighbors within a distance of d max , the Point Process is Markovian for a distance 2 d max . The Markovianity will prove useful to simplify the sampling procedure and run it in parallel over the whole image.

2.2.2. Papangelou Conditional Intensity

The reference Poisson Point Process has an intensity λ that is either constant or depends on the location (For any compact A S , E [ n ( Φ ) ] = λ | A | if λ is constant.). The density in (1) implies the intensity is now a function of the location and neighborhood of a point:
Definition 2.
The Papangelou conditional intensity λ ( · ; · ) [17] associated with a Point Process Φ , is defined as:
λ ( y ; y ) d y = p n Φ ( d y ) = 1 | Φ ( d y ) c = y ( d y ) c ,
i.e., the infinitesimal probability of finding a point in region d y around y S , given the configuration y outside d y (i.e., ( d y ) c ). When y y , the Papangelou conditional intensity can be computed from the energy function U as:
λ ( y ; y { y } ) = h ( y | X ) h ( y { y } | X ) = exp U ( y { y } , X , θ ) U ( y , X , θ ) .

2.3. Energy-Based Model

The energy formulation of the Point Process allows the compositing of several simple point behaviors into one model. We illustrate the composition of several simple point distributions through the addition of energy terms in Figure 2: There, we show that the point distribution in (Figure 2d) can be obtained by simply adding the energies used to produce Figure 2b,c.
When performing object detection, the image provides information about the location and properties of objects (e.g., location-based energy as in Figure 2b), while prior knowledge of the objects of interest complements this information (e.g., a repulsion term as in Figure 2c); the composability of the energy model allows the factoring in of both pieces of information into a single model (as in the composition of Figure 2d).
More specifically: For every point y y we compute a set of energy terms V e ( y , X , N y y ) , e ξ and combine those—per point—as a weighted sum [20,21,23]:
U y , X , θ = y y V y , X , N y y , θ = y y w 0 + e ξ w e V e y , X , N y y , θ
with V ( y , X , N y y , θ ) is the energy contribution of a single object y. Energy terms can be grouped into two categories:
  • Prior terms (of the form V e ( y , N y y , θ ) ) that only depend on y and its neighborhood N y y ; they measure the coherence of the configuration itself considering the known properties of the objects of interest (e.g., objects do not overlap).
  • Data terms ( V e ( y , X , θ ) ) are a function of y and the image X (also referred to as observation in remote sensing); these terms measure the correspondence of the points with the image.
The weights w are part of the parameters θ that will need to be set before inference on any unseen image. We discuss the estimation of parameters θ in Section 2.5.
In the rest of this section, we define the multiple energy terms V e : First, the data terms, then the prior terms. Contrary to previous Point Process (PP) approaches, we do not rely on contrast measures but rather on data terms pulled from the output of a CNN: this allows for more reliable measures and faster computation. Moreover, this new model formulation opens up new improvements in sampling the energy model and estimating its parameters.

2.3.1. Data Terms from a CNN

Classical likelihood measurements [18,20,23] are based on contrast measures that are crafted to fit each application. Moreover, these measures rely heavily on the high contrast between objects and their background, along with limited background diversity. We illustrate the limitations of those in Section 3.1.
On the other hand, CNN models have shown to be very efficient at extracting features from images for object detection and classification. In the following section, we will show how to interpret a CNN-based object detector output to obtain a scalar energy that measures the fitness of a configuration against an image. It allows us to go past the contrast measure design by utilizing a pre-trained CNN output.

Position Likelihood Term

CNN-based object-detection methods such as [7,8,27] make use of a heatmap to find object centers. This object center probability map is obtained by passing the image X of shape H , W , 3 (here with 3 color channels) into a fully convolutional model such as Unet [28]. It outputs a tensor Z ^ pos R H × W , from which a probability-like measure of an object center at coordinates ( y i , y j ) is obtained as p ( y i , y j | X ) = σ ( Z ^ pos [ y i , y j ] ) where σ ( · ) is the sigmoid function, and Z ^ pos [ y i , y j ] the value of map Z ^ pos at coordinates ( y i , y j ) .
We propose to reinterpret this output as the position energy as follows:
V pos y , X , θ = ln 1 + exp Z ^ pos [ y i , y j ] + t pos ,
with t pos a scalar threshold, allowing the movement of the inflection point of the Softplus function x ln ( 1 + exp ( x ) ) .

Mark Likelihood Term

To compute the energy associated with marks, we discretize each mark κ ( κ { a , b , α } in the case of a Point Process (PP) of rectangles), into n κ classes (or value bins) in the range [ κ min , κ max ] . We define the integer class of a value v for mark κ as:
c κ ( v ) = n κ v κ min κ max κ min , v [ κ min , κ max ] ,
with c κ ( v ) being the corresponding integer class.
Now, supposing we have a model trained to classify the mark of an object at a given position ( i , j ) S , such a model will produce a probability estimate of mark κ to be in class c 1 , n κ , from model output Z ^ κ y i , y j R n κ as:
p ^ ( c | i , j , X ) = Softmax ( Z ^ κ y i , y j ) c = exp Z ^ κ y i , y j [ c ] c = 1 n κ exp Z ^ κ y i , y j [ c ] .
As in [29], we can reinterpret the model output into a potential, giving:
V κ ( y , X ) = Z ^ κ y i , y j c κ ( y κ ) + ln c = 1 n κ exp Z ^ κ y i , y j [ c ] .
As the CNN outputs a tensor Z ^ κ of shape ( H , W , n κ ) , we obtain vector Z ^ κ y i , y j by sampling tensor Z ^ κ at location ( y i , y j ) , as illustrated in Figure 3a. Vector Z ^ κ y i , y j gives an estimate of the probability of each class c once passed through a Softmax (as given by (8)); We illustrate this with the histogram, on top of Z ^ κ y i , y j in Figure 3a. The value of V κ ( y , X ) is then obtained by applying the formula in (9) over vector Z ^ κ y i , y j (as illustrated on the right-hand side of Figure 3).

Interpolation

The energies described in (6) and (9) are defined over integer coordinates in S and integer classes in M . However, points y in the configuration y have real-valued locations ( y i , y j ) R 2 as is for marks in M . Thus, we propose interpolating values between exact pixel locations. For (9) we now have:
V κ ( y , X ) = V κ [ y i , y j , c κ ( y κ ) ]
V κ [ i , j ] = Z ^ κ [ i , j ] + ln c = 1 n κ exp Z ^ κ [ i , j , c ] , i , j S d .
The above has two benefits: first, (10) defines the energy for continuous values of the marks using bilinear interpolation, thus ensuring Lipschitz continuity of the energy. Second, we can pre-compute (11) once for a given image, making the mark energy computation a simple value lookup and interpolation for each point. In practice, we proceed as illustrated in Figure 3b, where a function is applied to the CNN output Z ^ κ to obtain V κ , which is stored in memory. The interpolation gives V κ for any continuous position and mark. This pre-computation is quite fast, as it is defined as an operation over a tensor, which is implicitly parallelized by tensor computation libraries such as PyTorch [30]. The same holds for the position term (6).

2.3.2. Energy Priors

The total energy model encompasses several priors as energies, which allows regularizing configuration against the data terms.

2.3.2.1. Object Priors

These are functions of the current point y only. For k = ratio , area :
V k ( y , θ ) = exp 0.5 f k ( y ) μ k 2 σ k 2
with f ratio ( y ) = y b / y a and f area ( y ) = y a y b , and μ k , σ k being, respectively, the target value and the standard deviation. These two are illustrated in the first line of Figure 4.
For our objects of interest, we find two modes in the distribution of areas and ratios: Cars and trucks. To make sure to only favor these two modes and not objects with the area of a truck and ratio of a car, we introduce a joint area and ratio prior:
V jntAR ( y , θ ) = exp y a y b μ ratio 2 2 σ ratio 2 y a y b μ area 2 2 σ area 2 .

2.3.2.2. Interaction Priors

The following priors depend on the neighborhood of the point y. The term in (14) penalizes overlapping objects (with t o v e r l a p the overlap threshold), (15) favors aligned objects ( t α = 0 favors parallel objects, while t α = π / 2 prefers perpendicular objects), (16) and (17) are, respectively, repulsive and attractive priors. Finally, (18) allows the adjustment of the energy of neighborless points. Some of these priors are illustrated in Figure 4.
V ovrlp y , N y y = max y N y y Relu area y y min { area y , area y } t ovrlp
V align y , N y y = min y N y y cos y α y α
V repls y , N y y = max y N y y Relu 1 d y , y d m a x t repls
V attrc y , N y y = min y N y y Relu d y , y d m a x t attrc
V zrNbr y , N y y = 1 | N y y | = 0
where Relu ( x ) = max ( 0 , x ) for x R . Note we introduced a few parameters μ ratio , σ ratio , t ovrlp , , these will have to be either set by trial and error or through the parameter estimation method presented in Section 2.5.

2.3.3. Model Pipeline

We summarize the implementation of the overall energy model in Figure 5. As is, our architecture has the following advantages:
  • We simplify the data term computation from complex contrast measures [18,20,23] into a simple sampling and interpolation of a tensor (highlighted in green in Figure 5).
  • The backbone CNN architecture remains very simple as it does not need to model any object interactions.
  • The CNN inference—which is the most complex operation within the graph in Figure 5 (FCN block)—is limited to once per image, as the energy U can be computed for any configuration y Y without having to infer V pos and V κ a second time (pre-computed maps highlighted in lilac in Figure 5).
  • New interactions or priors can be easily added to the model as energies are aggregated through a simple linear combination (see right-hand part of the pipeline in Figure 5).
  • The implementation with tensor computation libraries (here PyTorch [30]) allows for implicit parallelization on GPU (or CPU) using the tensors batch dimension.
  • The latter also enables automatic differentiation of energy U, both relative to configuration y and to parameters θ , which will be, respectively, useful in Section 2.4.3 and Section 2.5.4.
  • Lastly, as the energy maps are defined over a raster space (thus easy to normalize), those will be of use to design the birth map in Section 2.4.1.

2.4. Sampling Configurations from the Energy Model

For a given energy model U (supposing parameters θ are set), we aim at sampling configurations that follow the Gibbs density defined in (1). This will allow us to find the most compatible configuration y * for any image X , i.e.,:
y * = arg min y Y U ( y , X , θ ) .
As the space Y in which we are trying to minimize the energy is not of fixed density, we must resort to an adapted sampling method such as Reversible Jump Monte Carlo Markov Chain (RJMCMC) [31]. While this method ensures proper sampling of the desired law, it often requires some application-specific adaptations to improve sampling time and efficiency (e.g., [20,23]).
Here, we propose to use the already computed energy maps V pos and V κ to add new points in relevant areas and to focus on the parallel sampling of the algorithm. Moreover, the implementation of our energy model allows us to leverage automatic gradient computation to implement diffusion dynamics and explore at a fixed number of points (i.e., within Y n ) guided by the whole energy model.

2.4.1. Birth Maps for Efficient Point Proposals

At their core, both RJMCMC [31] (an adaptation of the Metropolis-Hastings algorithm [32] to variable dimension problems) and Jump Diffusion [33] make use of Birth and Death moves to build a Markov chain ( y t ) t = 1 , of stationary density h ( · | X ) 1 / T t where T t is a temperature parameter that decreases towards zero (i.e., Simulated Annealing) (Convergence is proven for a logarithmic temperature decrease, here (and in the literature) we approximate it with a geometric decrease at it is faster [17]). That way, the chain converges towards the global minimum energy. Proposed Birth (addition of points to the current y t ) and Deaths (removal of points) are accepted or rejected given an acceptance ratio that depends on the energy change induced by the respective addition or removal of points (for more details see [31,34]).
The simplest birth move is to propose new points uniformly in S × M . However, this proves highly inefficient as the density of objects in the image varies a lot. Ideally, the birth move would propose a new point u to the current configuration y , sampled with the marginal density p ( u | y ) as is carried out within Gibbs sampling (maximizing the chance for the point addition to be accepted). Knowing that p ( u | y ) h ( y { u } ) / h ( y ) , we could compute the marginal density using energy change induced by adding the point. However, this cannot easily be computed because of the interaction terms. Thus, we propose to approximate it by only considering the data terms (i.e., bypassing the energy change on prior terms):
h ( y { u } ) h ( y ) exp w pos V pos ( u , X ) κ { a , b , α } w κ V κ ( u , X ) .
From this, we define the density d to sample a new point u:
d ( u ) = 1 Z d exp w pos V pos [ u ] κ { a , b , α } w κ V κ [ u ] ) ,
with V pos and V κ the pre-computed position and marks energy maps. Since both are defined as tensors with a finite number of elements, the normalizing constant Z d in (21) can be computed over this discretized space of pixels and mark classes.
This allows for efficient sampling of new points, without any application-specific heuristics, by taking into account a truncated version of the energy model U that can be normalized easily.

2.4.2. Focused Parallel Sampling

In its canonical implementation, the RJMCMC algorithm operates sequentially over the points in the image—It may only add/remove/transform one point at a time —, which increases simulation time linearly with the number of objects. With parallelization, we aim to take advantage of the spatial Markovianity of the Point Process, as Theorem 1 states that moves further than 2 d max apart are independent.
As in [35], we split the image into square cells of size d c that are each assigned one set s (each set corresponding to one color in Figure 6). These sets are referred to as mic-sets in [35] for Mutually Independent Cells. Set size d c is chosen as 2 d max + 2 δ max ( δ max is the maximum point translation distance, introduced in the next Section), so that moves in cells of one set (or color) have independent acceptance ratios: i.e., those moves can be performed in any sequential order –thus in parallel—without any change in the probability of acceptance. In practice, we have d c = 48 ( d max = 16 , δ max = 8 ).
We aim at leveraging the birth map stemming from the pre-computed energy maps V pos and V κ to focus the sampling on parts of the image where the density of objects is high. Contrary to [35], we do not use a quadtree structure of cells, but rather a regular grid of d c sized cells. In order to achieve an efficient sampling of cells (i.e., avoiding spending time in cells with no evidence of objects), weights are assigned to each cell to focus the sampling on the relevant ones. The sampling procedure goes as follows:
  • pick a move type (Birth, Death, or Diffusion),
  • pick one set s with probability p ( s ) ,
  • keep each cell in set s with probability p ( c | s )
  • for every cell c in set s run move restricted to c (Birth density becomes d c ( u ) )
The process of cell selection (step 3) allows for limiting the number of cells processed at once on big images, hence limiting the computational cost and focusing the sampling on areas with high object density. We pick the following densities:
p ( s ) = d ( s ) ,
p ( c | s ) = min ( 1 , n p d ( c ) d ( s ) ) ,
d c ( u ) = d ( u ) d ( c ) ,
denoting (by extension) d ( s ) and d ( c ) the density d integrated over all cells of set s and cell c, respectively. In practice, parameter n p = 1 ensures new points are sampled over the whole image with density d; higher values of n p allow for more parallel cells to be sampled each step at the cost of straying away from density d.

2.4.3. Jump Diffusion: Leveraging Gradients

The canonical RJMCMC algorithm also uses local transform moves to explore Y at a fixed number of points, picking one point to translate, rotate, and scale at random in y t . This is wildly inefficient as it does not consider the energy function U it is trying to minimize. Instead, we propose to use the energy gradient to explore the space more efficiently. This is the idea behind stochastic diffusion (or Langevin) dynamics [36,37]. If y t and T t denote, respectively, the configuration and temperature at time t, then the configuration for the next step of the Markov chain is given by y t + 1 = y t + d y t , with step size δ :
d y t = δ U ( y t , X , θ ) y t + d w t 2 T t , d w t N ( 0 , δ ) .
At high temperatures, the Brownian motion from d w t ensures an exploration of space. At low temperatures, the Brownian motion is negligible, and the diffusion performs as a gradient descent. This allows fine-tuning the configuration at low temperatures while considering the whole energy model (contrary to the truncated energy used for birth maps in Section 2.4.1).
The diffusion alone allows for the exploration of space Y locally around y t at a fixed dimension (fixed number of points). To explore (or jump) across dimensions, we make use of the Birth and Death moves as defined previously (see Section 2.4.1).
In practice, as parallelization requires some maximum displacement on points (see Section 2.4.2), the step d y t from y t to y t + 1 is clipped; for each point y y t updated to y , we bound the i and j components by δ max (i.e., | y i y i | δ max , and similarly for j).

2.4.4. Resulting Sampling Method

The resulting sampling method is outlined in Algorithm 1. The method requires the following inputs:
  • x 0 : initial configuration;
  • X : image;
  • θ : energy model parameters;
  • T 0 : initial temperature;
  • n s : number of samples;
  • α : temperature decay rate.
Here, we denote by C c x the restriction of configuration x to cells c. Please note that to compute U for iteration of the loop on t, and the energy maps V pos , V κ are only computed once as stated in Section 2.3.3.
Algorithm 1: Sampling method.
Input 
x 0 , X , θ , T 0 , n s , α
1:
for t = 0 , , n s 1 do
2:
   Pick diffusion with probability 0.8 , else jump
3:
   Pick mic-set s with probability p ( s )
4:
   Keep each c in s with probability p ( c | s ) to make s ˜
5:
    x t + 1 x t              unvisited cells will keep the previous state
6:
   for all  c s ˜
7:
     if diffusion then
8:
        d w N ( 0 , β )
9:
        Δ C c x t = β C c x t U ( x t , X , θ ) + d w 2 T t
10:
        C c x t + 1 C c x t + Δ C c x t
11:
     else
12:
        Q c Q c , B with probability 0.5 else Q c , D         pick birth or death
13:
        x Q c ( x · )   remove a point at random or add one using the birth map
14:
        r Q c ( x x ) Q c ( x x ) exp U ( x , X , θ ) U ( x , X , θ ) T t         compute Green ratio
15:
        C c x t + 1 C c x with probability min ( 1 , r )
16:
     end if
17:
  end for
18:
   T t + 1 α T t         decrease temperature
19:
end for
Output: 
x n s
The above implements the sampling improvements described previously, namely:
  • sampling new points using the birth map (line 13),
  • sampling in parallel over the whole image (line 6), in practice using tensor batch dimensions,
  • using the energy gradient to perform Diffusion (line 9).

2.5. Estimating Model Parameters

In this section, we estimate the parameters θ , so that, for each image X , the associated ground truth configuration y GT matches the most compatible configuration, i.e.:
y GT = arg min y Y U ( y , X , θ )
Previous PP approaches for object detection often relied on trial-and-error parameter estimation or used linear programming [23,38]. The latter method would generate a set of constraints (These constraints can be seen as a local reformulation of global constraint (26).) U ( y ) U ( y GT ) where wrong configurations y are generated by applying strong perturbations on the ground truth configuration y GT . However, this only estimates the weights w e and is prone to over-constraining when the ground truth is noisy or when considering too many constraints. Moreover, this method requires designing the procedure to generate the configurations y , which we find has a great effect on the estimated parameters.

2.5.1. Maximum Likelihood Learning

To tackle the above limitations, we turn towards the Energy-Based Model (EBM) literature for new parameter estimation methods. The authors in [26] propose to learn EBM parameters by maximizing the likelihood for the data D :
p y 1 GT , , y | D | GT | X 1 , , X | D | , θ = k = 1 | D | p y k GT | X k , θ .
The negative log-likelihood is then given, for parameters θ n at step n, as:
L NLL θ n , D = 1 D k = 1 D U y k GT , X k + , θ n + 1 β log y Y exp β U y , X k , θ n
with β an inverse temperature parameter (from the Gibbs distribution) that does not affect the position of the minimum [26]. The gradient of L N L L on θ n is shown to be
L NLL θ n , y i GT , X i θ n = U y i GT , X i , θ n θ n y Y U y , X i , θ n θ n p y | X i , θ n
where p ( y | X i , θ n ) = exp ( β U ( y , X i , θ n ) ) / y ˜ Y exp ( β U ( y ˜ , X i , θ n ) ) .
While the integral y Y U ( y , X i , θ n ) θ n p ( y | X i , θ n ) remains intractable, it can be approximated through Monte Carlo sampling, where y ˜ 0 , , y ˜ N are drawn from the law defined by p ( · | X i , θ n ) , which yields:
y Y U ( y , X i , θ n ) θ n p y | X i , θ n 1 N k = 1 N U y ˜ k , X i , θ n θ n

2.5.2. Contrastive Divergence

The authors in [39,40] propose to use a single sample in their Contrastive Divergence (CD) method. This method also uses a few simulation steps for the Monte Carlo Markov chain (MCMC) to generate y , starting from the desired answer y GT .
The general idea is to generate contrastive samples y that follow the density derived from U ( · , X , θ n ) at step n of the optimization. Then we proceed to update θ n to θ n + 1 , by gradient descent, to minimize the energy of the valid sample y GT , while maximizing the energy of the contrastive sample y (see Figure 7). Alternatively, we can augment the data and use positive samples y + = y GT + N ( 0 , σ + ) to replace y GT as in [41]. In the case of imperfect GT, this models the uncertainty over the labels while providing some data augmentation.
The loss to minimize is then:
L θ n , y + , y , X = U ( y + , X , θ n ) U ( y , X , θ n ) + γ R V R V = y { y + , y } 1 n ( y ) y y V y , X , N y y , θ n ,
with γ > 0 the weight of regularization term R V . We introduce the regularization term to avoid an explosion of the per-point energy V ( y , X , N y y , θ n ) . To ensure a sparse weighting of energies (i.e., minimize the number of non-zero weights w e ) we can introduce a new regularization term as an L 1 norm on the vectors of weights [42]:
R 1 = e ξ | w e | .
The broad strokes of the estimation procedure for parameters θ are as follows:
  • Pick a pair ( y GT , X ) from data D .
  • Generate positive sample y + = y GT + N ( 0 , σ + ) .
  • Generate negative sample y exp ( β 1 U ( y , X , θ n ) ) .
  • Compute the loss L θ n , y + , y , X (see (31)).
  • Update θ n to θ n + 1 according to the gradient L , with the Stochastic Gradient Descent (SGD) method [43], as illustrated in Figure 7.
  • Loop back to step 1 until convergence.

2.5.3. Replay Buffer

As sampling contrastive samples y can be time-consuming, we use the replay buffer introduced by [41]. The replay buffer saves Markov chain results at the current optimization step to use for initialization in the next steps, thus saving computing time. This allows for reducing the long simulation time necessary to pass the burn-in period of the Markov Chain [44] and obtain samples y . We use a sample from the law derived from U ( · , X , θ n 1 ) to initialize the chain that samples the law derived from U ( · , X , θ n ) . The use of the replay buffer within one step n of the optimization loop goes as follows:
  • With probability p B set configuration x 0 as a value of the replay buffer B picked at random. With probability 1 p B (or if the buffer is empty) set configuration x 0 as a random configuration in Y .
  • Run the Markov Chain ( x t ) t = 0 , , n s to simulate model of energy U ( · , X , θ n ) .
  • Use y = x n s as a negative sample for the loss computation and update of θ n to θ n + 1 .
  • Update buffer B B { y } .
This buffer allows running longer chains across epochs (i.e., obtaining higher-quality samples) while maintaining a low computational cost.

2.5.4. Parameters Estimation Algorithm

The final parameter estimation algorithm is as follows:
At line 4 we select a temperature at which to run the Markov chain to sample the negative configuration y ( β 1 = 1 , β 2 = 10 2 , β 3 = 10 4 ); higher temperatures allow sample diversity, while lower ones seek points of minimal energy within the current energy model. Scaling ς computed at line 6 allows rescaling the energy to a unit variance model in order to ensure consistency in sample diversity for a set temperature. It is computed over 1000 samples y U ( Y ) . Finally, for each temperature β and image X , we set a specific buffer B X , β . The parameter K is the number of steps the Markov chain runs for every gradient descent step. In practice, we set K = 0.02 d c 2 so that the number of iterations scales with the number of pixels in a cell.
Algorithm 2 we propose above provides a single procedure to estimate all parameters in θ rather than only the weights w e as done previously with linear programming in [23,38]. On top of avoiding over-constraining by having no hard constraints, our method also removes the need to design a procedure to sample configurations y as those are sampled from the current energy model U ( · , X , θ n ) . Finally, the replay buffer allows for qualitative samples while maintaining a limited computational burden.
Algorithm 2: Contrastive Divergence parameters estimation.
1:
B , n 0
2:
while not converged do
3:
    for all  ( y GT , X ) in D
4:
      β U ( { β 1 , β 2 , β 3 } )              select temperature to sample negative config.
5:
      x 0 U ( B X , β ) with probability p B , else U ( Y )            retrieve relevant buffer
6:
      ς Var ( U T = ( · , X , θ n ) )                   compute energy scale
7:
      y Sample x 0 , X , θ n , ς β , K , 1                    see Algorithm 1
8:
      y + y GT + N ( 0 , σ + )
9:
      Δ θ n θ n L θ n , y + , y , X
10:
     Update θ n + 1 with Δ θ n using SGD
11:
      B X , β { y } B X , β                        update the buffer
12:
      n n + 1
13:
  end for
14:
end while

2.6. Papangelou Intensity as a Confidence Score

Classical CNN-based object-detection models for object detection (such as [8,45,46]) yield a confidence score s ( y ) R for each proposed object y in the image. This confidence score is often interpreted, for each detection, as proportional to the probability of proposed element y to be a true positive, s ( y ) p ( y | X ) . Applying a score (or confidence) threshold t s gives a set of detections for which metrics such as precision and recall can be computed by matching the detections with the ground truth. This allows adapting the threshold according to the need for the application; i.e., some applications may require low false positives (high precision) while others require fewer missed detections (high recall). To assess the performance independently of the threshold selection, the Average Precision (AP) metric sums up the performance of a model as the area under the Precision-Recall curve.
Previous Point Process approaches [18,47,48] only compute simple metrics such as precision, recall, or F1 score for the configuration given by the sampling procedure, as no score is associated with each object detection.

2.6.1. Papangelou Intensity as Score

With our PP approach, we propose to introduce a scoring function, first to filter the detections given a confidence threshold and second to be able to compare our method to others using the widely used AP metric. Within the PP framework, the probability of one proposed point being an object of interest depends on the rest of the inferred configuration y ^ ; thus the scoring function reflects it: s ( y | y ^ { y } ) p ( y | y ^ { y } , X ) . From (3), we have that the Papangelou conditional intensity is proportional to the probability of finding a point y y in a small neighborhood d y knowing the rest of the configuration y { y } . We propose to use the Papangelou conditional intensity as a score:
s ( y | y { y } ) = λ ( y ; y { y } )

2.6.2. Pruning Sequence

However, the dependency of the score on the current configuration yields a complication while computing the AP: when applying a threshold t s to prune the configuration y into y y , for any y y , the score s ( y | y { y } ) may differ from s ( y | y { y } ) . With a score of the form s ( y ) that only depends on y and the image X —such as those from classical CNN models—the score from one object after pruning is unchanged.
In the PP case, we compute the scores by sequentially removing the lowest scoring point until none is left; i.e., we build a sequence of configurations y 1 y 2 y n ( y ^ ) 1 y n ( y ^ ) , with y 1 = y ^ , for n = 1 , , | y ^ | :
y n + 1 = y n { y n } , y n = arg min y y n λ ( y ; y n { y } ) ,
s ( y n | y n { y n } ) = s ( y n | y n + 1 ) = λ ( y ; y n + 1 ) .
Equation (34) provides a pruning order y 1 , , y n ( y ^ ) of points in y ^ . This ordering allows plotting the precision and recall curve. Indeed, to trace a Precision-Recall curve, one only requires the sequence of ( Recall ( t s ) , Precision ( t s ) ) pairs, which are obtained by sequentially pruning the lowest scoring points. Equation (35) provides a score for each point y n .

2.6.3. Contrastive Divergence Loss and Papangelou Intensity

On the one hand, the energy model is trained by minimizing the loss function in (31) derived from the likelihood maximization of the parameters regarding the annotated data. On the other, we evaluate the performance of the inferred configuration with the scoring method in (35) sourced from the Papangelou intensity. Here, we show that while the two are derived differently, the minimization of the loss function leads to good properties on the score function.
Here, we consider a simplified loss with only the two energy terms (as γ 0 ). Denoting the energy change induced by the move from configuration y to x as Δ U ( y x ) = U ( x ) U ( y ) , we have:
L ( θ , y + , y ) = Δ U ( y y + ) .
Similarly, the Papangelou intensity can be rewritten as such:
λ ( u ; y ) = exp ( Δ U ( y { u } y ) )

Single Point Addition

Thus, for a simple negative sample y = y + { u } in which we add a non-valid point u to y + , we have:
L ( θ , y + , y ) = log ( λ ( u ; y + ) ) .
This leads to the expected behavior: minimizing the loss L leads to minimizing the score of non-valid point u. The same stands for the removal of a valid point y y + and maximizing its score.

Arbitrary Sequence of Moves

This is also valid for the generic case where y is generated from an arbitrary sequence of additions or removal of points (A translation/rotation/scaling can be viewed as removal and addition.) from y + . This defines a sequence ( y k ) k = 0 , , n of n configurations as:
k = 1 , , n , y k = y k 1 { y k } if y k y + y k 1 { y k } otherwise ,
with y 0 = y + , y = y n , and y k elements of either S × M or y + . Without loss of generality, we can reorder the sequence to match the pruning order defined in (34). The energy change for one move is given as:
Δ U ( y k 1 y k ) = log ( λ ( y k ; y k 1 { y k } ) ) if y k y + log ( λ ( y k ; y k 1 ) ) otherwise .
As we have (by definition) Δ U ( x x ) = Δ U ( x x ) + Δ U ( x x ) , the loss is given as:
L = y k y + log ( λ ( y k ; y k 1 ) ( a ) ) y k y + log ( λ ( y k ; y k 1 { y k } ) ( b ) ) .
By ordering the y k , y k to match the pruning order in (34) each λ ( y k ; ) can be matched to their respective score:
  • (41)(a) corresponds to non-valid points added to y + , their score is minimized as the loss is decreased;
  • (41)(b) corresponds to valid points removed from y + . Their score is increased as the loss is minimized.
With this, we showed that minimization of the loss at a configuration level leads to the expected results on object scores.
While this score function allows taking into account the interaction between objects that the PP model introduces, the decomposable nature of the PP (see (5)) allows for the interpretability of the results as demonstrated in Section 3.4.

2.7. Method Summary

In this section, we proposed a model for object detection that combines CNN architectures with a Point Process (PP) framework. Before applying our approach to real data, we shall summarize its function here. Considering configurations of objects c y , image X , and parameters θ , we look to build an energy model U ( y , X , θ ) such that the minimum energy point of it corresponds to the ground truth configuration for that image y GT . As such, we build U to be a combination of multiple energy terms (Equation (5)), with prior terms that encode the interaction model of our objects (where high energies repel unwanted configurations and lower ones favor others), and data terms that we build from the features from a CNN (Section 2.3.1). We refer the reader to Figure 5 for an overall view of the model pipeline. Within this framework, we can combine the information from the image with the prior knowledge of the configurations.
To estimate the parameters θ , we propose to use Contrastive Divergence. In short, we initialize the parameters at random and alternate between sampling low-energy configurations with the current parameters—that are nowhere close to the ground truth at the beginning of the procedure—and updating the parameters to increase the energy of these samples while maintaining low energy for the ground truth (Section 2.5). This allows us to estimate any differentiable parameters in the model, while previous methods would often become over-constrained and could only estimate linear parameters.
Now, given parameters θ are set, we want to estimate the configuration of an unseen image X . This is achieved by sampling the configuration y * that minimizes the energy through a Jump Diffusion mechanism. Alternatively, proposing to add or remove points in the configuration guided by the energy maps the CNN provides (Section 2.4.1) and lowering the energy locally at a fixed number of points with gradient descent (Section 2.4.3). The latter can be applied in parallel over the image while focusing on patches of the image with higher expected object density (Section 2.4.2).
Lastly, we propose a metric based on the Papangelou intensity that allows us to compare our approach to others and takes into account the object interactions in the likelihood of detection (Section 2.6).

3. Results

3.1. Comparison to PP Based on Contrast Measures

To demonstrate the need for learned likelihood measures for our data, we evaluate a few contrast measures found in the literature [20,24] in the task of classification of objects and non-objects (Figure 8). Table 1 formulates these measures by denoting μ y , σ y 2 the empirical mean and variance of pixels inside shape y; μ s , σ s 2 mean and variance on shape contour s; n y the normal vector of contour s, and X the image gradient.
To assert the viability of such measures for these data, we compute a score s ( y ) = exp ( V c ( y ) for proposed detections y (both ground truth and randomly scattered rectangles). The true objects represent 1 % of this experiment’s dataset (A higher proportion of true objects would allow high average precision for useless estimators predicting a constant value). Here we consider around 32k object proposals (i.e., 313 true objects, and 31 , 300 non-objects). A well-suited measure s ( y ) should allow classifying easily between ground truth objects and random, non-valid objects. We plot Precision-Recall curves for several of those contrast measures in Figure 8 both for a sample image of DOTA 0.5 and a synthetic highly contrasted image. The Average Prevision (AP) values are reported in Table 1. We compare with the proposed detection energy given by V pos + V a + V b + V α .
The proposed measure, based on CNN output, outperforms classical contrast-based measures, as the CNN has learned to differentiate between relevant and irrelevant contrasted objects (Irrelevant objects include, for instance, AC units—a highly contrasted, car-sized object—or some pavement features.).

3.2. Models

In this article, we show results on two CNN-based models and two PP models. A Python implementation of our models is available at https://github.com/Ayana-Inria/PP-EBM (accessed on 31 January 2024).
  • CNN-LocalMax.: Naive detection from the CNN backbone (used in the PP models); we find objects through local maxima in the output probability maps (or local minima in the energy maps).
  • CNN-PP m : PP model with minimal inferred parameters: Pre-set priors (i.e., with manually set parameters), only the weight vector w e are estimated with the CD method. The values of manually set parameters are detailed in Table 2.
  • CNN-PP w : PP model with the whole parameter vector θ ^ (which encompasses the weights w e and priors’ parameters) estimated with the CD method.
  • BBA-Vec. and YOLOV5-OBB: Lastly, we compare all our models above with BBA-Vec. from [8] and YOLOV5-OBB from [46]. These models are trained on the same data as the above-mentioned models.

3.3. Results on Remote-Sensing Data

The models are evaluated on a test split of the DOTA 0.5 dataset (for quantitative and qualitative results) and on ADS (for qualitative results as these data are not annotated). Metrics are computed by matching the object proposals and ground truth using the IOU (Intersection Over Union) with a 0.25 threshold. The Average Precision (AP) metrics (which are threshold independent) are shown in Table 3 and are derived from Precision-Recall (PR) curves shown in Figure 9. We show detection results in Figure 10 (lines 1–2 for DOTA 0.5 and line 3 for DOTA 0.5 +noise), with a score threshold set for each model to the one maximizing the F1 score. The ADS data inference results in Figure 11 are obtained with the models trained on the DOTA 0.5 training set.

Computational Complexity

Our two PP-based methods—running on an Nvidia Quadro RTX 8000 Graphics Processing Unit—execute in 300 s on average, for 16 k parallel iterations (equivalent to 77 k sequential iterations) for one image. With an efficient implementation, considering a density of 1.9 × 10 3 ( 95 th percentile of the observed object densities (i.e., 95 % of observed object densities are below this value)), we estimate the cost of one iteration to be around 5 operations per pixel per iteration, thus around 4 × 10 5 operations per pixel in total. We show the derivation of those values in Appendix B. As a point of comparison, Transformer models for image classification range from 5 × 10 6 to 1.8 × 10 7 operations (see [49]). The complexity could be greatly reduced by reducing the number of iterations of the Point Process sampler at the cost of straying away from the proper convergence properties.

3.4. Results Interpretability

Due to the decomposition of the total energy into energy terms introduced in (5), the object score can be decomposed similarly:
s ( y | y { y } ) = e ξ s e ( y | y { y } )
with s e ( y | y { y } ) = exp ( w e Δ V e ( y y { y } ) ) , the Papangelou intensity obtained by considering the single energy term e. This allows viewing the contribution of each component. Moreover, we propose grouping these contributions into the data and prior contributions to, respectively, obtain s data and s prior such that the final score is a product of the two: s ( y | y { y } ) = s data ( y ) s prior ( y | y { y } ) .
In Figure 12, we illustrate how the two components of the score s can help analyze the results. Here we manually cluster the results into 4 basic categories: green objects correspond to detection with high prior and data scores (high s, top 25 % ), while blue detections have a higher data contribution ( s prior < s data ). The yellow detections correspond to objects with lower data scores ( s prior > s data ), often located in ambiguous locations. Purple objects have a low total score s (i.e., low confidence, bottom 15 % ). These detections are often pruned out when using a score threshold (as used in Figure 10 and Figure 11). This allows for identifying where the model is confident, either due to the data from the image or the interactions with nearby objects. One can access even more detailed information by looking into the individual scores s e for each energy term V e .

4. Discussion

4.1. Quantitative and Qualitative Results

In Figure 10, we observe that despite noisy ground truth labels—a potential cause for over-constraining issues when using linear programming (see Section 2.5)—our models infer more regular configurations that align better with the physical constraints of the objects of interest. Sample 1 in Figure 10 BBA-Vec. struggles to properly align and detect all the objects. In sample 2, model YOLOV5-OBB shows patches of objects with no overlap regularization. Meanwhile, our model outputs regularized configurations even in these dense areas. Figure 10 (line 3) and Figure 11 showcase the robustness of our PP-based models when tested on noisy data or areas of limited information, such as shaded areas. The inference results on the ADS dataset, obtained with models trained on the DOTA 0.5 training set, highlight the model’s ability to generalize to unseen, albeit similar, data.
Comparison between CNN-PP m and CNN-PP w against CNN-LocalMax. in Figure 10 and Table 3 reveals the improvement brought by the PP-based approach, as the performance of the CNN alone (CNN-LocalMax.) falls short of our proposed method especially when confronted with noisy data.
Lastly, the comparable performances of CNN-PP m and CNN-PP w , both qualitatively and quantitatively, highlight the efficacy of the CD method in inferring model parameters. This approach significantly reduces the need for manual parameter tuning (more specifically, prior parameters), where traditional PP methods, such as linear programming, fall short.

4.2. Limitations

Our method, while promising, faces several limitations that warrant consideration. First, the computational complexity involved in sampling a PP can be significant, particularly with dense configurations of objects. To address this, we could explore approximate sampling methods, although this might come at the expense of result accuracy, for instance, by implementing faster annealing techniques or discretizing the state space [48].
Second, while the model we present is tailored to a specific type of object, the underlying tools are adaptable to various object types. First, the PP framework allows for a wide diversity of objects to be modeled, thanks to the versatility of marks. Moreover, the energy formulation of our model allows for easy integration or removal of interaction models and priors as simple energy functions. For instance, we could incorporate classification as a mark in the Point Process model, utilizing features from the CNN or a prior law based on the geometric features of the objects (e.g., distinguishing trucks from cars based on length).
Lastly, our method relies on annotated data for training. However, energy models offer compositional flexibility; one potential avenue is to train the data terms of the model on annotated images while utilizing synthetic datasets (devoid of visual components) for prior terms training. This approach could help mitigate the reliance on annotated data and improve generalization to diverse scenarios.

5. Conclusions

In this paper, we present a method that integrates interaction models into object-detection algorithms, leveraging deep convolutional neural networks. We address challenges inherent in satellite imagery, such as atmospheric disturbances and limited resolution, by incorporating prior knowledge about object arrangements.
While CNN-based methods excel at pattern extraction, they struggle with learning object-to-object interactions without complex attention mechanisms. Conversely, Point Process methods offer a promising alternative, addressing both object likelihoods and arrangement coherence. However, existing approaches relying on contrast measures have limitations.
Our method combines CNN pattern extraction with the Point Process framework, simplifying contrast measure computations into efficient energy maps. By leveraging pre-computed potential maps, we enhance Point Process sampling methods, improving exploration efficiency within the state space.
We depart from conventional parameter estimation approaches for Point Process methods like linear programming and adopt Contrastive Divergence to estimate energy weights and internal parameters automatically. This improves accuracy and reduces reliance on manual parameter tuning.
To facilitate model comparison, we introduce a novel scoring function based on the Papangelou conditional intensity, providing a more comprehensive evaluation metric that takes into account interactions and allows for the explainability of results thanks to the easy decomposition of the energy model.
Experimental results on diverse datasets demonstrate our method’s effectiveness, particularly in noisy environments. Our model’s ability to generalize to new data showcases its robustness and potential for practical applications.
In conclusion, our method offers a promising approach to object detection, combining the strengths of CNNs with Point Process models. It shows potential for improving accuracy and efficiency in satellite imagery analysis or other applications where the priors are strong: e.g., object tracking (priors on dynamics) or road extraction (geometrical priors).

Author Contributions

Conceptualization, M.O. and J.Z.; Data curation, J.M.; Formal analysis, J.M.; Funding acquisition, M.O. and J.Z.; Investigation, J.M.; Methodology, J.M., M.O. and J.Z.; Project administration, J.Z.; Resources, J.M., M.O. and J.Z.; Software, J.M.; Supervision, M.O. and J.Z.; Validation, J.M., M.O. and J.Z.; Visualization, J.M.; Writing—original draft, J.M.; Writing—review and editing, J.M., M.O. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by BPI France via the LiChIE project.

Data Availability Statement

The DOTA dataset is available at https://captain-whu.github.io/DOTA/dataset.html (accessed on 31 January 2024). The ADS data are not publicly available, due to industrial property restrictions.

Acknowledgments

Thanks to the OPAL infrastructure from Université Côte d’Azur for providing computational resources and support.

Conflicts of Interest

J. Zerubia and J. Mabon are employed by Inria. M. Ortner is employed by Airbus Defense and Space. J. Mabon’s PhD grant is part of the LiChIE contract, financed by BPI France, in partnership with Airbus Defense and Space. The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ADSAirbus Defense and Space
APAverage Precision
CDContrastive Divergence
CNNConvolutional Neural Network
EBMEnergy-Based Model
FCNFully Convolutional Network
GTGround Truth
PPPoint Process
(RJ)MCMC(Reversible Jump) Monte Carlo Markov Chain

Appendix A. Proofs

Appendix A.1. Markovianity of the Point Process

We denote ∼ the interaction relation y y | | y y | | < d max resulting in neighborhood N y y . We need to determine the relation m (with associated neighborhood { y } y ) for which the Point Process in Markovian. Given Definition 1, the dependency of ratio h ( y { u } ) / h ( y ) will give us the minimal requirement for m (i.e., the most restrictive relation that still ensures Markovianity of the process):
h ( y { u } ) h ( y ) = exp U ( y , X ) U ( y { u } )
= exp y y V y , X , N y y V y , X , N y y { u } V u , X , N u y { u }
= exp y N u y V y , X , N y y V y , X , N y y { u } V u , X , N u y .
The above follows from the fact that y N u y N y y = N y y { u } (as ∼ is symmetric). This shows the density ratio depends on u and the first and second-degree neighbors of u. Defining the relation for Markovianity such as y m y y y , y y y is impractical as it depends on y . With ∼ derived from maximum distance d max , the following relation is sufficient to have Markovianity w.r.t. relation m :
y m y | | y y | | < 2 d max .

Appendix A.2. Existence of Z

We denote f = exp ( U ( y , X , θ ) ) the unnormalized density from which h is derived in (1). The Ruelle condition [50] ensures the Point Process is stable (i.e., the density can be normalized) given the following condition:
Condition A1.
A Point Process specified by an unnormalized density h w.r.t. the measure μ of the Poisson process is Ruelle-stable if there exists M 1 such that
f ( y ) M n ( y ) , y Y .
From the above, it follows that the unnormalized density can be bounded:
y Y f ( y ) d μ ( y ) n = 0 M n ν ( S ) n ! = exp ( M ν ( S ) ) ,
with μ the Poisson Point Process density and ν the associated measure of space (e.g., Lebesgue measure).
It is enough to show the following for Condition A1 to be true:
f ( y { y } ) M f ( y ) , y Y , u S .
By construction, the energy of a single point is bounded (see (5) and Section 2.3.2.1), we have:
A > 0 , | V ( y , X , N y y , θ ) | < A , y y , y Y .
Then, from the results of (A3), we have:
f ( y { u } ) f ( y ) exp 2 A N y y + A .
In practice, the number of points in a cell is bounded by n c , max . Thus, the maximum number of points in a neighborhood is bounded by | N y y | 4 n c , max 1 . We can then find a bound that satisfies (A7) as:
f ( y { u } ) f ( y ) exp A ( 8 n c , max 1 ) .

Appendix B. Complexity

From Algorithm 1, we derive the theoretical computation cycle cost of one sampling loop over a single cell:
F c 3.8 × 10 2 + 4.9 × 10 6 λ + 5.9 × 10 8 λ 2 ,
where λ is the density of objects in the image (For details on the derivation see [51].) To obtain the operations per pixel (ops/px/iter), we compute F c / d c 2 . In practice, we use n s = 77000 iterations of the Markov chain, so the total number of operations (ops/px) is n s F c / d c 2 . In the testing data, we observe a variety of object densities. We report complexity values for the minimum and maximum, average and 95 th percentile ( q 95 ) and report those values in Table A2. For these density values, we compute the cost per pixel with diffusion on whole cells, on single objects, and without diffusion in Table A1. By default, the model performs diffusion over a whole cell c. Alternatively, it can do diffusion upon a single point y in cell c for a lower cost. Finally, we report the cost without the diffusion mechanism (although the results are of lesser quality for the same number of iterations).
Table A1. Number of theoretical operations depending on object density.
Table A1. Number of theoretical operations depending on object density.
λ (a)ops/px/iterops/px
With diffusion on c λ min 0.2 1.5 × 10 4
λ avg 2.0 1.5 × 10 5
λ q 95 5.2 4.0 × 10 5
λ max 18.3 1.4 × 10 6
With diffusion on y λ min 1.1 8.2 × 10 4
λ avg 1.2 9.2 × 10 4
λ q 95 1.4 1.1 × 10 5
λ max 1.9 1.5 × 10 5
Without diffusion λ min 0.8 6.3 × 10 4
λ avg 1.0 8.0 × 10 4
λ q 95 1.4 1.0 × 10 5
λ max 2.3 1.8 × 10 5
(a) Values of λ reported in Table A2.
Table A2. Values for density lambda.
Table A2. Values for density lambda.
λ Value
λ min 1.4 × 10 5
λ avg 7.9 × 10 4
λ q 95 1.9 × 10 3
λ max 5.2 × 10 3

Appendix C. FCN for Small Objects in Large Images

For our application, we need a Fully Convolutional Network (FCN) [52] capable of responding to small objects in large images.
Definition A1.
A Fully Convolutional Network (FCN) is a type of CNN architecture making use of locally connected layers such as convolution pooling or upsampling [53].
A straightforward approach to learning a position probability map Z ^ pos would be to directly infer a heatmap of centers, i.e., obtained by applying a Gaussian blur on a binary map of object centers (Figure A1b). However, these maps are sparse (non-zeros probabilities are under-represented), and at inference, we observe some connectivity added in between blobs, making the separation of instances more difficult (see Figure A1b).
We circumvent this problem by learning a proxy task: The model is first trained to infer a map of 2D unit-length vectors H that point towards the closest instance center (Figure A1c) similar to [54]. We then apply the divergence operator, as object centers now correspond to convergence points in this vector field [55] (Figure A1d):
Z ^ pos = a . div H ^ + b .
The backbone architecture corresponds to a Unet [28], which outputs for an image X a tensor F ( X ) with N F features per pixel. From that, feature representation is extracted Z ^ pos and the Z ^ κ m m { a , b , α } .
In our experiments, we train the Unet to minimize the following cost function for position term [55] and mark terms:
L pos X , y GT = MSE H ^ , H + BCE Z ^ pos , Z pos ,
L κ X , y GT = 1 | S d | p S d CE Softmax Z ^ κ [ p ] , Z ^ κ [ p ] ,
with MSE the Mean Squared Error, BCE the Binary Cross Entropy, and CE the Cross Entropy. The tensors H , Z pos , and Z ^ κ are, respectively, the vector, position, and mark tensors built from the annotated ground truth.
Figure A1. Closeup image with centers overlay (a), centers heatmap (b), vector field (c), and divergence (d) (red > 0 , blue < 0 ).
Figure A1. Closeup image with centers overlay (a), centers heatmap (b), vector field (c), and divergence (d) (red > 0 , blue < 0 ).
Remotesensing 16 01019 g0a1

References

  1. Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
  2. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems (NIPS); Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R., Eds.; MIT Press: Montréal, Canada, 2015; Volume 28. [Google Scholar]
  3. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
  4. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]
  5. Li, Y.; Xing, Y.; Wang, Z.; Xiao, T.; Song, Q.; Li, W.; Wang, J. A Framework of Maximum Feature Exploration Oriented Remote Sensing Object Detection. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6001505. [Google Scholar] [CrossRef]
  6. Yao, Y.; Cheng, G.; Wang, G.; Li, S.; Zhou, P.; Xie, X.; Han, J. On Improving Bounding Box Representations for Oriented Object Detection. IEEE Trans. Geosci. Remote Sens. 2022, 61, 5600111. [Google Scholar] [CrossRef]
  7. Zhao, T.; Liu, N.; Celik, T.; Li, H.C. An Arbitrary-Oriented Object Detector Based on Variant Gaussian Label in Remote Sensing Images. IEEE Geosci. Remote. Sens. Lett. 2021, 19, 8013605. [Google Scholar] [CrossRef]
  8. Yi, J.; Wu, P.; Liu, B.; Huang, Q.; Qu, H.; Metaxas, D. Oriented Object Detection in Aerial Images with Box Boundary-Aware Vectors. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 2149–2158. [Google Scholar] [CrossRef]
  9. Cheng, G.; Wang, J.; Li, K.; Xie, X.; Lang, C.; Yao, Y.; Han, J. Anchor-Free Oriented Proposal Generator for Object Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 3183022. [Google Scholar] [CrossRef]
  10. Yang, Y.; Tang, X.; Cheung, Y.M.; Zhang, X.; Liu, F.; Ma, J.; Jiao, L. AR2Det: An Accurate and Real-Time Rotational One-Stage Ship Detector in Remote Sensing Images. IEEE Trans. Geosci. Remote. Sens. 2021, 60, 5605414. [Google Scholar] [CrossRef]
  11. LaLonde, R.; Zhang, D.; Shah, M. ClusterNet: Detecting Small Objects in Large Scenes by Exploiting Spatio-Temporal Information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4003–4012. [Google Scholar] [CrossRef]
  12. Corsel, C.W.; van Lier, M.; Kampmeijer, L.; Boehrer, N.; Bakker, E.M. Exploiting Temporal Context for Tiny Object Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 2–7 January 2023; pp. 1–11. [Google Scholar] [CrossRef]
  13. Cheng, G.; Li, Q.; Wang, G.; Xie, X.; Min, L.; Han, J. SFRNet: Fine-Grained Oriented Object Recognition via Separate Feature Refinement. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5610510. [Google Scholar] [CrossRef]
  14. LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
  15. Zeng, Q.; Ran, X.; Zhu, H.; Gao, Y.; Qiu, X.; Chen, L. Dynamic Cascade Query Selection for Oriented Object Detection. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6008605. [Google Scholar] [CrossRef]
  16. Lu, X.; Sun, X.; Diao, W.; Mao, Y.; Li, J.; Zhang, Y.; Wang, P.; Fu, K. Few-Shot Object Detection in Aerial Imagery Guided by Text-Modal Knowledge. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5604719. [Google Scholar] [CrossRef]
  17. Lieshout, M.C.V. Markov Point Processes and Their Applications; Imperial College Press: London, UK, 2000. [Google Scholar]
  18. Descombes, X. Multiple Objects Detection in Biological Images Using a Marked Point Process Framework. Methods 2017, 115, 2–8. [Google Scholar] [CrossRef] [PubMed]
  19. Verdié, Y.; Lafarge, F. Detecting Parametric Objects in Large Scenes by Monte Carlo Sampling. Int. J. Comput. Vis. 2014, 106, 57–75. [Google Scholar] [CrossRef]
  20. Lacoste, C.; Descombes, X.; Zerubia, J. Point Processes for Unsupervised Line Network Extraction in Remote Sensing. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1568–1579. [Google Scholar] [CrossRef] [PubMed]
  21. Ortner, M.; Descombes, X.; Zerubia, J. A Marked Point Process of Rectangles and Segments for Automatic Analysis of Digital Elevation Models. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 105–119. [Google Scholar] [CrossRef] [PubMed]
  22. Perrin, G.; Descombes, X.; Zerubia, J. Tree Crown Extraction Using Marked Point Processes. In Proceedings of the 12th European Signal Processing Conference (EUSIPCO), Vienna, Austria, 6–10 September 2004; pp. 2127–2130. [Google Scholar]
  23. Craciun, P.; Ortner, M.; Zerubia, J. Joint Detection and Tracking of Moving Objects Using Spatio-temporal Marked Point Processes. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 6–9 January 2015; pp. 177–184. [Google Scholar] [CrossRef]
  24. Kulikova, M.S.; Jermyn, I.H.; Descombes, X.; Zhizhina, E.; Zerubia, J. Extraction of Arbitrarily-Shaped Objects Using Stochastic Multiple Birth-and-Death Dynamics and Active Contours. In Proceedings of the Computational Imaging VIII, San Jose, CA, USA, 18–19 January 2010; Volume 7533, pp. 58–64. [Google Scholar]
  25. Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar] [CrossRef]
  26. LeCun, Y.; Chopra, S.; Hadsell, R.; Ranzato, M.; Huang, F.J. A Tutorial on Energy-Based Learning. Predict. Struct. Data 2006, 1, 59. [Google Scholar]
  27. Huang, Z.; Li, W.; Xia, X.G.; Tao, R. A General Gaussian Heatmap Label Assignment for Arbitrary-Oriented Object Detection. IEEE Trans. Image Process. 2022, 31, 1895–1910. [Google Scholar] [CrossRef]
  28. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI); Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Lecture Notes in Computer Science; Springer: Munich, Germany, 2015; pp. 234–241. [Google Scholar] [CrossRef]
  29. Grathwohl, W.; Wang, K.C.; Jacobsen, J.H.; Duvenaud, D.; Norouzi, M.; Swersky, K. Your Classifier Is Secretly an Energy Based Model and You Should Treat It like One. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, IL, USA, 6–9 May 2019. [Google Scholar]
  30. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic Differentiation in PyTorch. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 9 May 2017; Available online: https://openreview.net/forum?id=BJJsrmfCZ (accessed on 5 March 2024).
  31. Green, P.J. Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination. Biometrika 1995, 82, 711–732. [Google Scholar] [CrossRef]
  32. Metropolis, N.; Rosenbluth, A.W.; Rosenbluth, M.N.; Teller, A.H.; Teller, E. Equation of State Calculations by Fast Computing Machines. J. Chem. Phys. 1953, 21, 1087–1092. [Google Scholar] [CrossRef]
  33. Grenander, U.; Miller, M.I. Representations of Knowledge in Complex Systems. J. R. Stat. Soc. Ser. B: Stat. Methodol. 1994, 56, 549–581. [Google Scholar] [CrossRef]
  34. Green, P.J.; Hastie, D.I. Reversible Jump MCMC. Genetics 2009, 155, 1391–1403. [Google Scholar]
  35. Verdié, Y.; Lafarge, F. Efficient Monte Carlo Sampler for Detecting Parametric Objects in Large Scenes. In Proceedings of the European Conference on Computer Vision (ECCV), Florence, Italy, 7–13 October 2012; pp. 539–552. [Google Scholar] [CrossRef]
  36. Miller, M.; Grenander, U.; OSullivan, J.; Snyder, D. Automatic Target Recognition Organized via Jump-Diffusion Algorithms. IEEE Trans. Image Process. 1997, 6, 157–174. [Google Scholar] [CrossRef]
  37. Lafarge, F.; Gimel’farb, G.; Descombes, X. Geometric Feature Extraction by a Multimarked Point Process. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1597–1609. [Google Scholar] [CrossRef]
  38. Yu, Q.; Medioni, G. Multiple-Target Tracking by Spatiotemporal Monte Carlo Markov Chain Data Association. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 2196–2210. [Google Scholar] [CrossRef] [PubMed]
  39. Hinton, G. Training Products of Experts by Minimizing Contrastive Divergence. Neural Comput. 2002, 14, 1771–1800. [Google Scholar] [CrossRef] [PubMed]
  40. Teh, Y.W.; Welling, M.; Osindero, S.; Hinton, G.E. Energy-Based Models for Sparse Overcomplete Representations. J. Mach. Learn. Res. 2003, 4, 1235–1260. [Google Scholar]
  41. Du, Y.; Mordatch, I. Implicit Generation and Modeling with Energy Based Models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
  42. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
  43. Bottou, L. Stochastic Gradient Descent Tricks. In Neural Networks: Tricks of the Trade, 2nd ed.; Montavon, G., Orr, G.B., Müller, K.R., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 421–436. [Google Scholar] [CrossRef]
  44. Robert, C.P.; Casella, G. Diagnosing Convergence. In Monte Carlo Statistical Methods; Robert, C.P., Casella, G., Eds.; Springer: Berlin/Heidelberg, Germany, 2004; pp. 459–509. [Google Scholar] [CrossRef]
  45. Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
  46. Yang, X.; Yan, J. On the Arbitrary-Oriented Object Detection: Classification Based Approaches Revisited. Int. J. Comput. Vis. 2022, 130, 1340–1365. [Google Scholar] [CrossRef]
  47. Li, T.; Comer, M.; Zerubia, J. An Unsupervised Retinal Vessel Extraction and Segmentation Method Based On a Tube Marked Point Process Model. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1394–1398. [Google Scholar] [CrossRef]
  48. Pham, T.T.; Hamid Rezatofighi, S.; Reid, I.; Chin, T.J. Efficient Point Process Inference for Large-Scale Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, 27–30 June 2016; pp. 2837–2845. [Google Scholar] [CrossRef]
  49. Zhao, Y.; Wang, G.; Tang, C.; Luo, C.; Zeng, W.; Zha, Z.J. A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP. arXiv 2021, arXiv:2108.13002. [Google Scholar] [CrossRef]
  50. Ruelle, D. Superstable Interactions in Classical Statistical Mechanics. Commun. Math. Phys. 1970, 18, 127–159. [Google Scholar] [CrossRef]
  51. Mabon, J. Learning Stochastic Geometry Models and Convolutional Neural Networks. Application to Multiple Object Detection in Aerospatial Data Sets. Ph.D. Thesis, Université Côte d’Azur, Nice, France, 2023. [Google Scholar]
  52. Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar] [CrossRef]
  53. Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef]
  54. Neven, D.; Brabandere, B.D.; Proesmans, M.; Van Gool, L. Instance Segmentation by Jointly Optimizing Spatial Embeddings and Clustering Bandwidth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 8829–8837. [Google Scholar] [CrossRef]
  55. Mabon, J.; Ortner, M.; Zerubia, J. CNN-Based Energy Learning for MPP Object Detection in Satellite Images. In Proceedings of the IEEE 32nd International Workshop on Machine Learning for Signal Processing (MLSP), Xi’an, China, 22–25 August 2022; pp. 1–6. [Google Scholar] [CrossRef]
Figure 1. Example configuration with three points; y = { y 1 , y 2 , y 3 } .
Figure 1. Example configuration with three points; y = { y 1 , y 2 , y 3 } .
Remotesensing 16 01019 g001
Figure 2. Point Processes derived from simple energies; (a): V a ( y ) 1 , (b): V b ( y ) y i , (c): V c ( y | N y y ) min y N y y d ( y , y ) , (d): V d ( y | N y y ) V b + V c . The horizontal axis corresponds to coordinate i and vertical to j.
Figure 2. Point Processes derived from simple energies; (a): V a ( y ) 1 , (b): V b ( y ) y i , (c): V c ( y | N y y ) min y N y y d ( y , y ) , (d): V d ( y | N y y ) V b + V c . The horizontal axis corresponds to coordinate i and vertical to j.
Remotesensing 16 01019 g002
Figure 3. Energy over discrete marks (a) and continuous marks (b).
Figure 3. Energy over discrete marks (a) and continuous marks (b).
Remotesensing 16 01019 g003
Figure 4. Illustration of some priors. For priors acting on pairs of objects ( zrNbr , align , ovrlp , repls ), we show the energy for a pair of objects y and u (see (14) to (18) for the aggregation of multiple pair interactions into a single energy scalar).
Figure 4. Illustration of some priors. For priors acting on pairs of objects ( zrNbr , align , ovrlp , repls ), we show the energy for a pair of objects y and u (see (14) to (18) for the aggregation of multiple pair interactions into a single energy scalar).
Remotesensing 16 01019 g004
Figure 5. Energy model pipeline. The pre-computed tensors can be reused for computing different y Y for a given image X .
Figure 5. Energy model pipeline. The pre-computed tensors can be reused for computing different y Y for a given image X .
Remotesensing 16 01019 g005
Figure 6. Space partitioning, where each color corresponds to a different set s. Each cell c s z (square of size d c ), is the z th cell of set s. For every point y, we display its interaction radius d max as a semi-transparent circle.
Figure 6. Space partitioning, where each color corresponds to a different set s. Each cell c s z (square of size d c ), is the z th cell of set s. For every point y, we display its interaction radius d max as a semi-transparent circle.
Remotesensing 16 01019 g006
Figure 7. Effect of training the energy with contrastive samples generated from the current θ n , amplified for illustrative purposes, as the gradient step that updates θ n is small (Figure adapted from [26]).
Figure 7. Effect of training the energy with contrastive samples generated from the current θ n , amplified for illustrative purposes, as the gradient step that updates θ n is small (Figure adapted from [26]).
Remotesensing 16 01019 g007
Figure 8. Comparing contrast measures on synthetic (first line) and real data (second line). Left: Positive and negative samples represent, respectively, 1 % and 99 % of the samples for which the metric is computed. Right: Precision-Recall curves for the measures proposed in Table 1; blue: t-test [20], orange: image gradient [24], red (our data term), green: constant value.
Figure 8. Comparing contrast measures on synthetic (first line) and real data (second line). Left: Positive and negative samples represent, respectively, 1 % and 99 % of the samples for which the metric is computed. Right: Precision-Recall curves for the measures proposed in Table 1; blue: t-test [20], orange: image gradient [24], red (our data term), green: constant value.
Remotesensing 16 01019 g008
Figure 9. Precision-Recall (PR) curves for the models listed in Table 3.
Figure 9. Precision-Recall (PR) curves for the models listed in Table 3.
Remotesensing 16 01019 g009
Figure 10. Samples of detection on the test dataset. The score threshold (to not display low-score objects) is set to maximize the F 1 score for each model. The first two lines correspond to samples from DOTA 0.5 while the third one corresponds to DOTA 0.5 +noise.
Figure 10. Samples of detection on the test dataset. The score threshold (to not display low-score objects) is set to maximize the F 1 score for each model. The first two lines correspond to samples from DOTA 0.5 while the third one corresponds to DOTA 0.5 +noise.
Remotesensing 16 01019 g010
Figure 11. Model applied to ADS data. The first column shows the input image, as there are no ground truth annotations for this dataset.
Figure 11. Model applied to ADS data. The first column shows the input image, as there are no ground truth annotations for this dataset.
Remotesensing 16 01019 g011
Figure 12. Inferred configuration on an ADS data sample (a), colored according to their prior/data scores plotted in (b): yellow s prior > s data ; blue s prior < s data ; purple low s; green high s. Each point in (b) corresponds to a detection in s data , s prior space (log values scaled to [ 0 , 1 ] ).
Figure 12. Inferred configuration on an ADS data sample (a), colored according to their prior/data scores plotted in (b): yellow s prior > s data ; blue s prior < s data ; purple low s; green high s. Each point in (b) corresponds to a detection in s data , s prior space (log values scaled to [ 0 , 1 ] ).
Remotesensing 16 01019 g012
Table 1. Various contrast measures from literature and Average Precision (AP) computed on the DOTA image ( AP real ) or synthetic image ( AP synthetic ) from Figure 8.
Table 1. Various contrast measures from literature and Average Precision (AP) computed on the DOTA image ( AP real ) or synthetic image ( AP synthetic ) from Figure 8.
Measure V c Formulation AP real AP synthetic
CNN output (ours) V pos ( y , X ) κ V κ ( y , X ) 0.990.88
t-test [20] | μ y μ s | σ y 2 n y + σ s 2 n s 0.130.99
Gradient [24] n y ( t ) · X ( s ( t ) ) | X ( s ( t ) ) | 2 + ϵ d t 0.270.99
Constant 1.0 0.010.01
Table 2. Values for manually set parameters. Some energy terms such as V a have no parameters.
Table 2. Values for manually set parameters. Some energy terms such as V a have no parameters.
Energy TermParameterValue
V pos t pos 0
V a  *--
V b  *--
V α  *--
V align --
V ovrlp t ovrlp 0
V repls t repls 0
V attrc t attrc 0
V jntAR , car μ ratio 0.46
μ area 42
σ ratio 0.1
σ area 20
V jntAR , truck μ ratio 0.23
μ area 123
σ ratio 0.1
σ area 20
* Mark energy terms V κ , κ { a , b , α } .
Table 3. Average Precision (AP) values. AP 0 and AP N correspond, respectively, to the AP on the DOTA 0.5 and DOTA 0.5 +noise datasets. Precision-recall curves from which the AP is derived are shown in Figure 9.
Table 3. Average Precision (AP) values. AP 0 and AP N correspond, respectively, to the AP on the DOTA 0.5 and DOTA 0.5 +noise datasets. Precision-recall curves from which the AP is derived are shown in Figure 9.
Method AP 0 AP N
BBA-Vec.0.820.19
YOLOV5-OBB0.860.10
CNN-LocalMax.0.860.55
CNN-PP m 0.910.58
CNN-PP w 0.920.62
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mabon, J.; Ortner, M.; Zerubia, J. Learning Point Processes and Convolutional Neural Networks for Object Detection in Satellite Images. Remote Sens. 2024, 16, 1019. https://doi.org/10.3390/rs16061019

AMA Style

Mabon J, Ortner M, Zerubia J. Learning Point Processes and Convolutional Neural Networks for Object Detection in Satellite Images. Remote Sensing. 2024; 16(6):1019. https://doi.org/10.3390/rs16061019

Chicago/Turabian Style

Mabon, Jules, Mathias Ortner, and Josiane Zerubia. 2024. "Learning Point Processes and Convolutional Neural Networks for Object Detection in Satellite Images" Remote Sensing 16, no. 6: 1019. https://doi.org/10.3390/rs16061019

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop