Поиска по сайту

Система Orphus

 

Detection of deleted frames on videos using a 3D convolutional neural network

Скачать в формате pdf
PROCEEDINGS OF SPIE
SPIEDigitalLibrary.org/conference-proceedings-of-spie
Detection of deleted frames on
videos using a 3D convolutional
neural network
V. Voronin, R. Sizyakin, A. Zelensky, A. Nadykto, I.
Svirin
V. Voronin, R. Sizyakin, A. Zelensky, A. Nadykto, I. Svirin, "Detection of
deleted frames on videos using a 3D convolutional neural network," Proc.
SPIE 10802, Counterterrorism, Crime Fighting, Forensics, and Surveillance
Technologies II, 108020U (2 November 2018); doi: 10.1117/12.2326806
Event: SPIE Security + Defence, 2018, Berlin, Germany
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 11/26/2018 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
K5
K2
K1
Deleted frames K3, K4
K5
K2
K3
K4
Kl
K5
K4
K3
K2
K1
K3
Swapped frames K2, K4 Copied frames K3
Detection of deleted frames on videos using a 3D Convolutional
Neural Network
V. Voronin
a
, R. Sizyakin
a
, Zelensky
b
, A. Nadykto
b
, I. Svirin
c
a
Lab. «Mathematical methods of image processing and intelligent computer vision systems»,
Don State Technical University, Rostov-on-Don, Russian Federation
b
Moscow State University of Technology “STANKIN”, Moscow, Russia
c
CJSC Nordavind, Moscov, Russian Federation
ABSTRACT
Digital video forgery or manipulation is a modification of the digital video for fabrication, which includes frame
sequence manipulations such as deleting, insertion and swapping. In this paper, we focus on the detection problem
of deleted frames in videos. Frame dropping is a type of video manipulation where consecutive frames are deleted to
skip content from the original video. The automatic detection of deleted frames is a challenging task in digital video
forensics. This paper describes an approach using the spatial-temporal procedure based on the statistical analysis and
the convolutional neural network. We calculate the set of different statistical rules for all frames as confidence
scores. Also, the convolutional neural network used to obtain the output scores. The position of deleted frames
determines based on the two score curves for per frame clip. Experimental results demonstrate the effectiveness of
the proposed approach on a test video database.
Keywords: forgery detection, CNN, video manipulation.
1. INTRODUCTION
Currently, with the rapid development of mobile and portable video capture technology, the amount of video
material obtained with it is growing. One of the conditions for these videos is their authenticity. Digital video
forgery or manipulation is a modification of the digital video for fabrication, which includes frame sequence
manipulations such as deleting, insertion and swapping [1,2]. Frame dropping is a type of video manipulation where
consecutive frames are deleted to skip content from the original video. The automatic detection of deleted frames is
a challenging task in digital video forensics.
Most common temporal tampering in videos (Fig.1):
Frame dropping or frame removal;
Frame swapping;
Frame copy or frame addition;
Frame replacement.
Figure 1. Temporal tampering in videos.
Counterterrorism, Crime Fighting, Forensics, and Surveillance Technologies II, edited by Henri Bouma,
Radhakrishna Prabhu, Robert James Stokes, Yitzhak Yitzhaky, Proc. of SPIE Vol. 10802, 108020U
© 2018 SPIE · CCC code: 0277-786X/18/$18 · doi: 10.1117/12.2326806
Proc. of SPIE Vol. 10802 108020U-1
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 11/26/2018
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Since there are a lot of ways to manipulate video data, in this paper we consider only the problem of automatic
frame detection, with the estimated time gap. This gap is characterized by a sharp change of spatial information, as
well as the loss of correlation between adjacent frames.
There are several basic schemes for detection frames removal [1-8]: watermarking-based, learning-based, threshold-
based, hashing-based.
One of the first works in this field is the work [9]. In this paper, the authors, based on the inter-frame difference of
brightness histograms, find a gap in the correlation component, which could indicate the location of the proposed
gluing. Since the user sets the threshold value in a heuristic way, the method requires a large amount of test data to
select the optimal value, which is not always achievable in practice. The work [10] is a modification of the work [9],
which consists of the subsequent processing of the obtained result to reduce false alarms. The threshold value is
selected by the method proposed in [11]. After the places of the proposed gluing are found, four conditions for the
spatial blocks into which adjacent frames are divided are checked. The first assumption is that the detection was due
to the rapid movement of objects in the frame, or gluing. The second is the assumption that the background is
homogeneous and stationary and does not contain glues. The third is that the background is movable and also does
not contain glues. The last assumption is that the texture elements in the frame are also stationary and do not contain
modifications. In this paper, the authors partially got rid of the dependence of setting the threshold value for the
preliminary localization of glues. However, the user's participation is necessary to set the threshold for checking the
above assumptions. Also, it should be noted that based only on the assessment of the brightness histogram; it is not
always possible to achieve the desired result. This is because a sharp change in brightness often leads to false alarms.
In [12], the authors use a modification of the texture operator LBP [13], as well as inter-frame correlation to localize
the glues in the video sequence. The texture operator LBP allows giving robustness to the method of lighting
differences in the frame [14]. Thus, the original LBP descriptor [15] is calculated by comparing each pixel with the
Central one, which is taken as a threshold value, in a local area of 3 by 3 pixels. If the center pixel is less than or
equal to the neighbor pixel, it is set to 1, otherwise 0. The modification is to increase the radius of the pixels, which
are compared with the Central pixel, as well as the use of only nine templates, which carry the most informative
about the texture features of the image and can reduce the number of non-informative bins.
One of the main drawbacks of the described methods is that they use only one characteristic as a base, and on which
they later rely to detect time gaps in the video sequence.
The objective of our work is to develop a new approach for the detection of deleted frames on videos using a set of
statistical characteristics and the convolutional neural network.
The rest of the paper is organized as follows. The proposed action recognition method is described in section 2.
Section 3 presents some experimental results and conclusions are given in section 4.
2. PROPOSED METHOD
This paper describes a framework for digital video forgery or manipulation (see Fig. 2). We propose an approach for
the detection of deleted frames in videos. The proposed algorithm is a two-stage procedure: (a) spatial-temporal
analysis based on the statistical analysis and (b) the convolutional neural network for frame drop detection.
The algorithm of the described method is presented in Figure 3. There are several basic steps. At the training step,
the CNN takes 9-frame video clips from the dataset, and produces two outputs, “frame deleted” or “no frame
deleted”. At the testing step, we calculate the set of different statistical rules for all frames as confidence scores.
Also, the learned convolutional neural network used to obtain the output scores. Based on the two score curves, we
calculation multiplication between them and use threshold for detection deleted frames for per frame clip.
Proc. of SPIE Vol. 10802 108020U-2
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 11/26/2018
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
L__
rIr
II
Video sequence
9 -frame clip
.=
Training stage
Testing stage
Calculation of
characteristics
Output scores
Confidence scores
=>
Testing stage
Frame 1
Frame 2
Frame N
i
1
CNN
Calculation of
characteristics
1
i
Output
scores
Confidence
scores
Multiplication
.&I
Comparison
with
Output result
Figure 2. The pipeline of the proposed method.
Figure 3. The algorithm workflow.
At the first step we calculate ConfidenceScores using the set statistical characteristics for each pair of adjacent
frames
=frgb(:,:,k-1) and

=frgb(:,:,k):
1) The inter-frame difference:

=sum(sum(abs(
-

)),
2) The inter-frame difference of mathematical expectations:

= abs(
- 

),







,
3) The inter-frame difference of variance:

= abs(
- 

),




 




,
4) The inter-frame difference of brightness histograms:

=sum(abs(
- 

)),
5) The correlation coefficient between the frame

and the compensated frame 
:
Proc. of SPIE Vol. 10802 108020U-3
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 11/26/2018
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use

= corr(

,
),

























,
6) The difference between the frame

and the compensated frame 
:

= sum(sum(abs(

- 
))),
7) The inter-frame difference of mathematical expectation of the optical flow:

= abs(
- 

),
8) The inter-frame difference of variance of the optical flow:

= abs(
- 

),
9) The inter-frame difference of standard deviation of the optical flow:

= abs(

-


),
10) The correlation coefficient between the amplitude of the optical flow and the compensated amplitude of the
optical flow:

= corr(

,
),
11) The difference between the amplitude of the optical flow and the compensated amplitude of the optical flow:

= sum(sum(abs(

- 
))),
MedFilt = medfilt1(h, t),
h = (
, 
, 
, 
, 
, 
, 
, 
,

, 
), t = 5.


 
,


 
,


 
,


 
,


 
,


 
,


 
,


 
,


 
,


 
,



,



,



,







,
Proc. of SPIE Vol. 10802 108020U-4
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 11/26/2018
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
1
1
1
1
Kernels
3x3.9
@30
Convolution 1
8x8x9
Kernels
Convolution 2
3.3+ßO
@50
6x6x30
Kernels
3440
@70
Convolution 3
4x4x50
Fully Fully
connected 1 connected 2
2x2x70
1x280 1x2



,



,



,



,



,



.
Next, the  is calculated as:


 
 

 


 
 

)/ 
.
The second step of the proposed algorithm is the detection of deleted frames on videos using the
Convolutional Neural Network. The CNN defines the class, to which each frame. The architecture of the neural
network is shown in Fig. 4. The model has following parameters were used to train for all experiments: size of the
mini batch equal 40, hidden convolutional layers produce 30, 50 and 70 feature maps with a kernel size of 3×3
pixels, respectively, first fully connected layer has 280 neurons, the learning rate is 0,0001. It is important to note
that the threshold value for all experiments is 0,7. The minimum classification error was achieved on average after
200 epochs.
Figure 4. The architecture of the proposed convolutional network.
At the final stage, the vectors 
 
multiply element by element to form
the resulting vector 
:


 
,
The time of a deleted frame is detected if any of the values in the 
vector exceeds the threshold value

, which is calculated as:



,
 
,
 
,
Proc. of SPIE Vol. 10802 108020U-5
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 11/26/2018
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
3. EXPERIMENTAL RESULTS
To obtain the training data we received 3000 videos. The length all videos are of 1-3 minutes. Some of the frames
from this dataset presents in Figure 5. All videos deviated on two groups: light conditions or complex conditions.
The light condition is the contrast videos with light brightness and slow object motion. The videos at the group with
complex conditions include lack of contrast, irregular lighting and brightness images what may not preserve the
local image features/details. Some of the videos are stationary scene and not contains moving objects.
We randomly selected 300 videos for training and adopted the rest 50 videos for validation. We developed a tool
that randomly drops fixed length frame sequences from videos. In our experiments, we manipulate each video many
different times to create more data. We vary the fixed frame drop length to see how it affects detection we used 0.5s,
1s, 2s, as different frame drops durations.
Figure 5. Frames from test dataset.
To evaluate the effectiveness of the proposed method we use the following metrics:
The probability of correct detection:
The probability of a false alarm:
The probability of false missing:
where TP is the true positive, FP is false positive, FN is a false negative, NumDefPix - the number of pixels
belonging to a crack, NumAllPix - total number of pixels, NumUndmgPix the number of pixels not belonging to the
crack.
The results of calculations the probabilities shown in Table 1. The analysis of the obtained results indicates that the
efficiency of the developed method is quite high and that the use of neural networks leads to significantly reduced
probabilities of false alarms.
Proc. of SPIE Vol. 10802 108020U-6
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 11/26/2018
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Table 1. The probabilities for detection of deleted frames on videos.
The probability of false
alarm
The probability of false
missing
The probability of correct
detection
Light conditions
3,8%
3,8%
96,1%
Complex conditions
8,5%
16%
88,9%
CONCLUSIONS
We propose the approach for the detection of deleted frames in videos. The proposed algorithm is a two-stage
procedure: the statistical analysis and the convolutional neural network. We calculate the set of different statistical
rules for all frames as confidence scores. Also, the convolutional neural network used to obtain the output scores.
Based on the two score curves, we calculation multiplication between them and use threshold for detection deleted
frames for per frame clip. The proposed method can identify whether frame dropping exists and even determine the
exact location of the frame is dropping without any information of the reference/original video. Experimental results
demonstrate the effectiveness of the proposed approach to a test video database. In the future, there is planned to
apply the presented approach to the video sequence in real time, to make comparisons with state-of-the-art methods.
ACKNOWLEDGMENT
This work was supported by PROVER project (https://prover.io/).
ABN thanks the Ministry of Science and Education for support under grants 1.6198.2017/6.7 and 1.7706.2017/8.9
and Center of Collective Use of MSTU Stankin for providing resources.
REFERENCES
[1] Redi, J. A., Taktak, W., and Dugelay, J. L., "Digital image forensics: a booklet for beginners," Multimed
Tools Appl, Vol. 51(1), 133162, (2011).
[2] Wang, W., "Digital video forensics," Ph.D. dissertation. Department of Computer Science, Dartmouth
College, Hanover, New Hampshire, (2009).
[3] Subramanyam, A.V. and Emmanuel, S., "Video forgery detection using HOG features and compression
properties," in Proc. IEEE 14th International Workshop on Multimedia Signal Processing (MMSP 2012),
89-94, (2012).
[4] Upadhyay, S. and Singh, S. K., "Learning Based Video Authentication using Statistical Local Information,"
in Proc. International Conference on Image Information Processing (ICIIP 2011), 1-6, (2011).
[5] Yu, J. and Srinath, M.D., "An efficient method for scene cut detection," Pattern Recognition Letters, 22,
1379-139, (2001).
[6] Yusoff, Y., Christmas, W., and Kittler, J., "Video Shot Cut Detection Using Adaptive Thresholding," in
Proc. British Mission Vision Conference (BMVC), 11-14, (2000).
[7] Muhammad, G., Hussain, M., and Bebis, G., "Passive copy move image forgery detection using
undecimated dyadic wavelet transform," Digital Investigation, 4957, (2012).
[8] Chetty, G., Biswas, M., and Singh, R., "Digital Video Tamper Detection Based on Multimodal fusion of
Residue Features," in Proc. 4th International Conference on Network and System Security (NSS), 606-613,
(2010).
[9] Hong, J.Z., Kankanhalli, A., Smoliar, S.W., "Automatic partitioning of full-motion video," Multimedia
Systems, (1993).
[10] Yu, J., Srinath, M.D., "An efficient method for scene cut detection," Pattern Recognition Letters, 1379-
1391, (2001).
[11] Kapur, J.N., Sahoo, P.K., Wong, A.K.C., "A new method for gray-level picture thresholding using the
entropy of the histogram," Computer vision, graphics, and image processing, 273-285, (1985).
Proc. of SPIE Vol. 10802 108020U-7
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 11/26/2018
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
[12] Zhenzhen, Z., Jianjun, H., Qinglong, M., Zhaohong, L., "Efficient video frame insertion and deletion
detection based on inconsistency of correlations between local binary pattern coded frames," Security
Comm. Networks, 311-320, (2015).
[13] Pietikäinen, M., Ojala, T., Maenpaa, T., "Multiresolution gray-scale and rotation invariant texture
classification with local binary patterns" IEEE Transactions on pattern analysis and machine intelligence,
vol. 24(7), (2002).
[14] Voronin, V., Marchuk, V., Sizyakin, R., Gapon, N., Pismenskova, M., Tokareva, S., "Automatic image
cracks detection and removal on mobile devices," Proc. SPIE 9869, Mobile Multimedia/Image Processing,
Security, and Applications, 98690R, (2016).
[15] Pietikäinen, M., Ojala, T., "Texture analysis in industrial applications," IT Advances in Image Processing,
Multimedia and Machine Vision, 337-359, (1996).
Proc. of SPIE Vol. 10802 108020U-8
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 11/26/2018
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
 
АО «Нордавинд» (© 2004-2024)
Контакты:
+7 (499) 130-98-92
[email protected]
Я зарегистрирован на Портале Поставщиков