Наука

Поиск

Detection of deleted frames on videos using a 3D convolutional neural network

Скачать в формате pdf

PROCEEDINGS OF SPIE

SPIEDigitalLibrary.org/conference-proceedings-of-spie

Detection of deleted frames on

videos using a 3D convolutional

neural network

V. Voronin, R. Sizyakin, A. Zelensky, A. Nadykto, I.

Svirin

V. Voronin, R. Sizyakin, A. Zelensky, A. Nadykto, I. Svirin, "Detection of

deleted frames on videos using a 3D convolutional neural network," Proc.

SPIE 10802, Counterterrorism, Crime Fighting, Forensics, and Surveillance

Technologies II, 108020U (2 November 2018); doi: 10.1117/12.2326806

Event: SPIE Security + Defence, 2018, Berlin, Germany

Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 11/26/2018 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use

Deleted frames K3, K4

Swapped frames K2, K4 Copied frames K3

Detection of deleted frames on videos using a 3D Convolutional

Neural Network

V. Voronin

, R. Sizyakin

, Zelensky

, A. Nadykto

, I. Svirin

Lab. «Mathematical methods of image processing and intelligent computer vision systems»,

Don State Technical University, Rostov-on-Don, Russian Federation

Moscow State University of Technology “STANKIN”, Moscow, Russia

CJSC Nordavind, Moscov, Russian Federation

ABSTRACT

Digital video forgery or manipulation is a modification of the digital video for fabrication, which includes frame

sequence manipulations such as deleting, insertion and swapping. In this paper, we focus on the detection problem

of deleted frames in videos. Frame dropping is a type of video manipulation where consecutive frames are deleted to

skip content from the original video. The automatic detection of deleted frames is a challenging task in digital video

forensics. This paper describes an approach using the spatial-temporal procedure based on the statistical analysis and

the convolutional neural network. We calculate the set of different statistical rules for all frames as confidence

scores. Also, the convolutional neural network used to obtain the output scores. The position of deleted frames

determines based on the two score curves for per frame clip. Experimental results demonstrate the effectiveness of

the proposed approach on a test video database.

Keywords: forgery detection, CNN, video manipulation.

1. INTRODUCTION

Currently, with the rapid development of mobile and portable video capture technology, the amount of video

material obtained with it is growing. One of the conditions for these videos is their authenticity. Digital video

forgery or manipulation is a modification of the digital video for fabrication, which includes frame sequence

manipulations such as deleting, insertion and swapping [1,2]. Frame dropping is a type of video manipulation where

consecutive frames are deleted to skip content from the original video. The automatic detection of deleted frames is

a challenging task in digital video forensics.

Most common temporal tampering in videos (Fig.1):

• Frame dropping or frame removal;

• Frame swapping;

• Frame copy or frame addition;

• Frame replacement.

Figure 1. Temporal tampering in videos.

Counterterrorism, Crime Fighting, Forensics, and Surveillance Technologies II, edited by Henri Bouma,

Radhakrishna Prabhu, Robert James Stokes, Yitzhak Yitzhaky, Proc. of SPIE Vol. 10802, 108020U

Proc. of SPIE Vol. 10802 108020U-1

Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 11/26/2018

Since there are a lot of ways to manipulate video data, in this paper we consider only the problem of automatic

frame detection, with the estimated time gap. This gap is characterized by a sharp change of spatial information, as

well as the loss of correlation between adjacent frames.

There are several basic schemes for detection frames removal [1-8]: watermarking-based, learning-based, threshold-

based, hashing-based.

One of the first works in this field is the work [9]. In this paper, the authors, based on the inter-frame difference of

brightness histograms, find a gap in the correlation component, which could indicate the location of the proposed

gluing. Since the user sets the threshold value in a heuristic way, the method requires a large amount of test data to

select the optimal value, which is not always achievable in practice. The work [10] is a modification of the work [9],

which consists of the subsequent processing of the obtained result to reduce false alarms. The threshold value is

selected by the method proposed in [11]. After the places of the proposed gluing are found, four conditions for the

spatial blocks into which adjacent frames are divided are checked. The first assumption is that the detection was due

to the rapid movement of objects in the frame, or gluing. The second is the assumption that the background is

homogeneous and stationary and does not contain glues. The third is that the background is movable and also does

not contain glues. The last assumption is that the texture elements in the frame are also stationary and do not contain

modifications. In this paper, the authors partially got rid of the dependence of setting the threshold value for the

preliminary localization of glues. However, the user's participation is necessary to set the threshold for checking the

above assumptions. Also, it should be noted that based only on the assessment of the brightness histogram; it is not

always possible to achieve the desired result. This is because a sharp change in brightness often leads to false alarms.

In [12], the authors use a modification of the texture operator LBP [13], as well as inter-frame correlation to localize

the glues in the video sequence. The texture operator LBP allows giving robustness to the method of lighting

differences in the frame [14]. Thus, the original LBP descriptor [15] is calculated by comparing each pixel with the

Central one, which is taken as a threshold value, in a local area of 3 by 3 pixels. If the center pixel is less than or

equal to the neighbor pixel, it is set to 1, otherwise 0. The modification is to increase the radius of the pixels, which

are compared with the Central pixel, as well as the use of only nine templates, which carry the most informative

about the texture features of the image and can reduce the number of non-informative bins.

One of the main drawbacks of the described methods is that they use only one characteristic as a base, and on which

they later rely to detect time gaps in the video sequence.

The objective of our work is to develop a new approach for the detection of deleted frames on videos using a set of

statistical characteristics and the convolutional neural network.

The rest of the paper is organized as follows. The proposed action recognition method is described in section 2.

Section 3 presents some experimental results and conclusions are given in section 4.

2. PROPOSED METHOD

This paper describes a framework for digital video forgery or manipulation (see Fig. 2). We propose an approach for

the detection of deleted frames in videos. The proposed algorithm is a two-stage procedure: (a) spatial-temporal

analysis based on the statistical analysis and (b) the convolutional neural network for frame drop detection.

The algorithm of the described method is presented in Figure 3. There are several basic steps. At the training step,

the CNN takes 9-frame video clips from the dataset, and produces two outputs, “frame deleted” or “no frame

deleted”. At the testing step, we calculate the set of different statistical rules for all frames as confidence scores.

Also, the learned convolutional neural network used to obtain the output scores. Based on the two score curves, we

calculation multiplication between them and use threshold for detection deleted frames for per frame clip.

Proc. of SPIE Vol. 10802 108020U-2

Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 11/26/2018

L__

rIr

Video sequence

9 -frame clip

Training stage

Testing stage

Calculation of

characteristics

Output scores

Confidence scores

Testing stage

Frame 1

Frame 2

Frame N

CNN

Calculation of

characteristics

Output

scores

Confidence

scores

Multiplication

.&I

Comparison

with

Output result

Figure 2. The pipeline of the proposed method.

Figure 3. The algorithm workflow.

At the first step we calculate ConfidenceScores using the set statistical characteristics for each pair of adjacent

frames 



=frgb(:,:,k-1) and 



=frgb(:,:,k):

1) The inter-frame difference:





=sum(sum(abs(



- 



)),

2) The inter-frame difference of mathematical expectations:





= abs(



- 



 





 













3) The inter-frame difference of variance:





= abs(



- 



 





 











 















4) The inter-frame difference of brightness histograms:





=sum(abs(



- 



)),

5) The correlation coefficient between the frame



and the compensated frame 



Proc. of SPIE Vol. 10802 108020U-3

Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 11/26/2018





= corr(



,









 









































 



























 



























6) The difference between the frame



and the compensated frame 







= sum(sum(abs(



- 



))),

7) The inter-frame difference of mathematical expectation of the optical flow:





= abs(



- 



8) The inter-frame difference of variance of the optical flow:





= abs(



- 



9) The inter-frame difference of standard deviation of the optical flow:





= abs(













10) The correlation coefficient between the amplitude of the optical flow and the compensated amplitude of the

optical flow:





= corr(



,



11) The difference between the amplitude of the optical flow and the compensated amplitude of the optical flow:





= sum(sum(abs(



- 



))),

MedFilt = medfilt1(h, t),

h = (



, 



, 



, 



, 



, 



, 



, 







, 



), t = 5.





 



 



,





 



 



,





 



 



,





 



 



,





 



 



,





 



 



,





 



 



,





 



 



,





 



 



,





 



 



,























































































Proc. of SPIE Vol. 10802 108020U-4

Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 11/26/2018

Kernels

3x3.9

@30

Convolution 1

8x8x9

Kernels

Convolution 2

3.3+ßO

@50

6x6x30

Kernels

3440

@70

Convolution 3

4x4x50

Fully Fully

connected 1 connected 2

2x2x70

1x280 1x2

































































































Next, the  is calculated as:





 



 



 









 



 









 



 









)/ 



.

The second step of the proposed algorithm is the detection of deleted frames on videos using the

Convolutional Neural Network. The CNN defines the class, to which each frame. The architecture of the neural

network is shown in Fig. 4. The model has following parameters were used to train for all experiments: size of the

mini batch equal 40, hidden convolutional layers produce 30, 50 and 70 feature maps with a kernel size of 3×3

pixels, respectively, ﬁrst fully connected layer has 280 neurons, the learning rate is 0,0001. It is important to note

that the threshold value for all experiments is 0,7. The minimum classiﬁcation error was achieved on average after

200 epochs.

Figure 4. The architecture of the proposed convolutional network.

At the final stage, the vectors 



 



 multiply element by element to form

the resulting vector 







 



 



The time of a deleted frame is detected if any of the values in the 



vector exceeds the threshold value





, which is calculated as:











  



,

  



,

Proc. of SPIE Vol. 10802 108020U-5

Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 11/26/2018

3. EXPERIMENTAL RESULTS

To obtain the training data we received 3000 videos. The length all videos are of 1-3 minutes. Some of the frames

from this dataset presents in Figure 5. All videos deviated on two groups: light conditions or complex conditions.

The light condition is the contrast videos with light brightness and slow object motion. The videos at the group with

complex conditions include lack of contrast, irregular lighting and brightness images what may not preserve the

local image features/details. Some of the videos are stationary scene and not contains moving objects.

We randomly selected 300 videos for training and adopted the rest 50 videos for validation. We developed a tool

that randomly drops ﬁxed length frame sequences from videos. In our experiments, we manipulate each video many

different times to create more data. We vary the ﬁxed frame drop length to see how it affects detection we used 0.5s,

1s, 2s, as different frame drops durations.

Figure 5. Frames from test dataset.

To evaluate the effectiveness of the proposed method we use the following metrics:

• The probability of correct detection:

• The probability of a false alarm:

• The probability of false missing:

where TP is the true positive, FP is false positive, FN is a false negative, NumDefPix - the number of pixels

belonging to a crack, NumAllPix - total number of pixels, NumUndmgPix – the number of pixels not belonging to the

crack.

The results of calculations the probabilities shown in Table 1. The analysis of the obtained results indicates that the

efficiency of the developed method is quite high and that the use of neural networks leads to significantly reduced

probabilities of false alarms.

Proc. of SPIE Vol. 10802 108020U-6

Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 11/26/2018

Table 1. The probabilities for detection of deleted frames on videos.

The probability of false

alarm

The probability of false

missing

The probability of correct

detection

Light conditions

3,8%

96,1%

Complex conditions

8,5%

16%

88,9%

CONCLUSIONS

We propose the approach for the detection of deleted frames in videos. The proposed algorithm is a two-stage

procedure: the statistical analysis and the convolutional neural network. We calculate the set of different statistical

rules for all frames as confidence scores. Also, the convolutional neural network used to obtain the output scores.

Based on the two score curves, we calculation multiplication between them and use threshold for detection deleted

frames for per frame clip. The proposed method can identify whether frame dropping exists and even determine the

exact location of the frame is dropping without any information of the reference/original video. Experimental results

demonstrate the effectiveness of the proposed approach to a test video database. In the future, there is planned to

apply the presented approach to the video sequence in real time, to make comparisons with state-of-the-art methods.

ACKNOWLEDGMENT

This work was supported by PROVER project (https://prover.io/).

ABN thanks the Ministry of Science and Education for support under grants 1.6198.2017/6.7 and 1.7706.2017/8.9

and Center of Collective Use of MSTU Stankin for providing resources.

REFERENCES

[1] Redi, J. A., Taktak, W., and Dugelay, J. L., "Digital image forensics: a booklet for beginners," Multimed

Tools Appl, Vol. 51(1), 133–162, (2011).

[2] Wang, W., "Digital video forensics," Ph.D. dissertation. Department of Computer Science, Dartmouth

College, Hanover, New Hampshire, (2009).

[3] Subramanyam, A.V. and Emmanuel, S., "Video forgery detection using HOG features and compression

properties," in Proc. IEEE 14th International Workshop on Multimedia Signal Processing (MMSP 2012),

89-94, (2012).

[4] Upadhyay, S. and Singh, S. K., "Learning Based Video Authentication using Statistical Local Information,"

in Proc. International Conference on Image Information Processing (ICIIP 2011), 1-6, (2011).

[5] Yu, J. and Srinath, M.D., "An efficient method for scene cut detection," Pattern Recognition Letters, 22,

1379-139, (2001).

[6] Yusoff, Y., Christmas, W., and Kittler, J., "Video Shot Cut Detection Using Adaptive Thresholding," in

Proc. British Mission Vision Conference (BMVC), 11-14, (2000).

[7] Muhammad, G., Hussain, M., and Bebis, G., "Passive copy move image forgery detection using

undecimated dyadic wavelet transform," Digital Investigation, 49–57, (2012).

[8] Chetty, G., Biswas, M., and Singh, R., "Digital Video Tamper Detection Based on Multimodal fusion of

Residue Features," in Proc. 4th International Conference on Network and System Security (NSS), 606-613,

(2010).

[9] Hong, J.Z., Kankanhalli, A., Smoliar, S.W., "Automatic partitioning of full-motion video," Multimedia

Systems, (1993).

[10] Yu, J., Srinath, M.D., "An efficient method for scene cut detection," Pattern Recognition Letters, 1379-

1391, (2001).

[11] Kapur, J.N., Sahoo, P.K., Wong, A.K.C., "A new method for gray-level picture thresholding using the

entropy of the histogram," Computer vision, graphics, and image processing, 273-285, (1985).

Proc. of SPIE Vol. 10802 108020U-7

Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 11/26/2018

[12] Zhenzhen, Z., Jianjun, H., Qinglong, M., Zhaohong, L., "Efﬁcient video frame insertion and deletion

detection based on inconsistency of correlations between local binary pattern coded frames," Security

Comm. Networks, 311-320, (2015).

[13] Pietikäinen, M., Ojala, T., Maenpaa, T., "Multiresolution gray-scale and rotation invariant texture

classification with local binary patterns" IEEE Transactions on pattern analysis and machine intelligence,

vol. 24(7), (2002).

[14] Voronin, V., Marchuk, V., Sizyakin, R., Gapon, N., Pismenskova, M., Tokareva, S., "Automatic image

cracks detection and removal on mobile devices," Proc. SPIE 9869, Mobile Multimedia/Image Processing,

Security, and Applications, 98690R, (2016).

[15] Pietikäinen, M., Ojala, T., "Texture analysis in industrial applications," IT Advances in Image Processing,

Multimedia and Machine Vision, 337-359, (1996).

Proc. of SPIE Vol. 10802 108020U-8

Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 11/26/2018