SaliencyBenchmarkUAV

How well current saliency prediction models perform on UAVs videos?

Motivation

The ubiquitous use of UAV imaging makes such content highly promising for current and future systems and services. We identified peculiarities of UAV videos that could modify behavior towards contents:

Bird-point-of-view change of semantic
Size of objects
Loss of pictorial depth cues (horizontal line)
Camera movements

Numerous application could benefit from knowedge in the gaze deployment in UAV videos, such as the automation of video pre-screening for surveillance applications. In that view, we wondered if current (beginning 2019) saliency models show efficient prediction of visual attention.

We first reviewed solution and schemes previously developped. We discovered that, to the best of our knowledge, there is no scheme or system dedicated to the UAV imagery. We thus wondered how well present solution perform to predict saliency in this new tpe of data.

From the state of the art, we derived a taxonomy showing the richness of propositions:

Benchmark set up

The benchmark includes seven static unsupervised/hand-crafted models, i.e., BMS, RARE 2012, Hou, SIM, SUN, GBVS, and Itti. ML-Net is included as a static deep learning model. Regarding dynamic schemes, PQFT is a machine learning system while ACL-Net and DeepVS deep architectures.

In addition, we added several baselines to measure the relative gain achieved by using predictive systems. Overall, sequences and all but sequence average saliency maps (OHM, SHM, and abSHM) examine content-dependencies. Center bias represent the most likely and present visual bias in conventional imaging. We also added shuffle maps to compare solutions to randomness.

We benchmarked these solutions on the EyeTrackUAV1 dataset, only gaze data on UAV video dataset available at that time. Typical saliency metrics were used to assess models: Correlation Coefficient (CC), Similarity (SIM), Area Under the Curve (AUC) Judd and Borji, Normalized Scanpath Saliency (NSS), Kullback Leibler divergence (KL).

Results

Key takeaway messages:

Static and dynamic deep learning models, trained on conventional contents, show the most promising results
- Potential of higher performance after fine-tuning or training
No significant difference between static and dynamic deep learning schemes
- Video characteristics (angle, distance to the scene, environment type and object size) uninformative
- Event-related annotations promising
UAV-specificities and dynamic challenges
- Need to study center-bias and other attention bias
- Need to develop video-based metrics
- Need to create larger UAV datasets

To go further:

Efficiency of models may be related to events such as reframing, that replace objects of interest in the center of the content. Here is an exemple of temporal results of CC on the sequence wakeboard10. We can see that around frame 250, after reframing, the metrics scores higher.

To illustrate the above, we have added a transparent colored filter representing the metric results for CC on ACL-NET over the sequence. Blue means a poor correlation between the prediction and the ground truth, red indicates a high predictive score.

Also, we question the suitability of center bias for all UAV videos based on the weak performance of existing saliency models and center bias baseline. Here are some exemples that qualitatively illustrates our doubts.

For more results on biases, see our work on UAV Biases.

To read the details or cite this work, please refer and cite the following:

Perrin, A. F., Zhang, L., & Le Meur, O. (2019, September). How well current saliency prediction models perform on UAVs videos?. In International Conference on Computer Analysis of Images and Patterns (pp. 311-323). Springer, Cham.