High Speed Single-Stage Face Detector using Depthwise Convolution and Receptive Fields

At present face detectors use a large Convolutional Neural Network (CNN) to achieve high detection performance, which is a widely used sub-area of artificial intelligence. These face detectors have a large number of parameters which reduces their detection speed dreadfully on a system with low computational resources. This is a challenging problem to achieve good performance and high detection speed with finite computational power. In this paper, we propose a single-stage end-to-end trained face detector to address this challenging problem. The computational cost is reduced by using depthwise convolution and swiftly reducing the size of an input image. The early layers of the model use CReLU (Concatenated Rectified Linear Unit) activations to preserve the information and generate better representative features of the input. Respective Field (RF) blocks used in the model improve the detection performance. The proposed model is of 1.7 Megabytes size, able to achieve 42 FPS (Frame Per Second) on CPU (i5-8330H) and 179 FPS on GPU (GTX1060). The model is evaluated on various benchmark datasets like WIDER FACE, PASCAL faces and AFW and archive good performance compared to other state of art methods. Keywords—Artificial intelligence; computer-vision; Convolutional Neural Network (CNN); face detector


I. INTRODUCTION
Face detection is defined as the problem of detecting and localizing faces in a given image. It is a basic and long-standing problem of active research in computer vision. Applications such as face recognition, face tracking and face hallucination, use face detection as a primary and essential preprocessing step. Many practical systems for facial analysis, surveillance and bio-metric, requires fast and accurate face detection.
There are two challenging problems encountered in face detection. The first problem is of classifying faces with a large variety of facial appearances from a complex background. Second of detecting faces of different sizes at different positions in given images. The two problems are related to computational cost and speed of face detection. It is a challenging task to develop a face detector that creates a balance between two problems. Another problem is that is the boundary of an object is blurred by imaging systems also [1], [2].
Face detection methods can be broadly divided into two categories, traditional methods and CNN based methods. The traditional methods, are very fast but does not have good accuracy. These methods use hand-crafted features to train the classifiers. Viola-Jones [3] and Deformable Part Models (DPM) [4] are good examples of traditional methods which have good speed with decent accuracy. The performance of these detectors decreases in an unconstrained environment. This is mainly due to non-robust handcrafted features.
The CNN based methods can achieve high performances at cost of speed. This significant improvement in the accuracy of face detection diverted researchers attention towards CNN based face detectors. CNN models can achieve high performance by using a large number of convolutional layers, which are also responsible for the slow speed of the detector. For example, some recent high performing face detectors like DSFD [5], Pyramidbox [6] and Retinaface [7], use large CNN models like VGG-16 [8] and Resnet-152 [9]. These CNN models consist of a large number of parameters, for example, VGG-16 has 100 million parameters and Resnet-152 has 65 million parameters. CNN methods [5], [6], [7] are slow, hence not suited for many practical systems. Cascade CNN [10], [11] can be used to improve the detection speed. But these detectors suffer two limitations. First, each stage of the cascade is trained and optimized separately which make training difficult and also affect its performance. Second, the speed of the detector directly proportional to the number of faces in an image.
In this paper, a lightweight single-stage end-to-end trained face detector with fast speed and good accuracy is proposed. The proposed method can be divided into two networks, backbone network which extracts feature from input images and detection network which localize the faces. The backbone network uses depthwise separable convolution with large strides to swiftly reduces the dimension of input. Instead of using the max-pooling layer model as given in [12], the proposed model use depthwise separable convolution to reduce the size because it adds extra feature layers and hence provides better feature representation. CReLU [13] activation are used to preserve the information while reducing the size of the input using large strides in the proposed network. The detection network consists a Receptive Field (RF) blocks followed by depthwise convolution layers. A feature map from RF blocks is used for detection.
The main contribution of this paper can be summarized as follows: (1) Propose a new lightweight backbone design to overcome the drawbacks of previous methods. (2) The new lightweight face detection method is proposed by integrating the backbone network with an RF-based detection network for fast and accurate face detection. method performs better than other methods. (4) Experiments performed on CPU and GPU hardware shows that the proposed method is suitable for practical systems. Hereafter the paper is organized as follows, Section 2 contain a brief review of available CNN based face detectors and techniques used in proposed methods. Section 3 is about the proposed method, it explains the framework of the method and its implementation details. Results obtained from experiments are discussed in Section 4, followed by the conclusion in Section 5.

A. CNN based Face Detectors
Almost all modern days face detectors uses CNN architectures. The CNN based face detectors can be classified into three categories, i.e., cascade face detectors, region-based face detectors and single-stage face detectors.
The cascade face detectors divide the detection task into more than one CNN networks. CNN cascade structure introduced in [10], it consists of six CNN networks, three networks for each classification and calibration respectively. Architecture consisted classification network followed by a calibration network. MTCNN [11] reduced the number of networks to three by integrating classification and calibration task into one network. The first network is called P-Net, which proposes a facial region. Later two networks, O-Net and Rnet, refines the proposals. The author in [14] divided P-Net into six sub-networks to detect faces at multiple scales. This improves detection performance for tiny faces. The detection performance of cascade face detectors are is improved by adding extra information about facial parts [15], [14]. In cascade framework, the first Network proposes the facial regions and subsequent networks process these regions. This makes the speed of detectors dependent on the number of faces in the images and it is a major limitation of these detectors.
The region-based and single-stage detectors are also known as two-stage and single-stage detectors, respectively. Both the detectors were developed for generic object detection. Later these detectors were modified to be used for face detection. The region-based detectors have two stages, first stage generates object proposal regions from proposal generators. The precise location and class of the object are estimated in the second stages. R-CNN based face detectors [16], [17] use RPN (Region Proposal Networks [18]). The performance of the method is further improved by CMS-RCNN [19] by adding contextual in formations. The region-based detectors use large CNN networks for the second stage. This lead to high detection accuracy but framework processing speed becomes slow.
Single-stage eliminates the region proposal stage and use a single stage to make predictions. These detectors are computationally efficient compared to region-based detectors but suffer detection accuracy. Single-stage face detectors are inspired by generic object detectors like YOLO [20] and SSD [21]. These detectors have attracted more researchers because of there high-speed detection. Different architectures [5], [6], [7] have been proposed recently. Lightweight CNN architecture [12], [22] uses inception module, CReLU activation and also propose densification strategy for anchors to improve recall. LFFD [23] paper proposes an anchor-free lightweight model by using Receptive Fields (RF) as natural anchors for detection. The model parameters were significantly reduced to 0.1 million in [24] by integrating the image pyramid with the CNN network and using weight sharing. But still, there is a large room for improving the processing speed without sacrificing detection accuracy.

B. Receptive Field (RF) and Dilation
Receptive Fields (RF) in CNN are inspired by the human visual system. RF in the visual system is neurons respond to a particular area of the retina. Similarly, in CNN each neuron has an RF field that responds to a particular area of an input [25]. In other words, RF defines the local region of an image to which the neuron will respond. The area RF is determined by the kernel size used in the convolution layer. RF has two important properties, first, each neuron in CNN has unique activation for a given image region and second, pixels surrounding RF have a large impact on activation. The impact of neighbouring pixels can be represented as Gaussian-Distribution [23], and known as ERF (Effective RF), This RF also helps in detection by adding contextual information to the network.
The RF of CNN can be increased by adding convolution layers, depthwise convolution or by using dilated convolutions [26]. Adding convolution layers (increasing depth of networks) increases computational cost. So using dilation convolution and depthwise more effective way to increase RF. The dilation convolution introduced in [27] as astrous convolution. Dilation convolution is very similar to conventional convolution layer except there is a gap in kernel values which is decided by dilation rate. The author in [12], [23] used RF and dilation convolution for face detection.

C. Depthwise Separable CNN
Many states of art CNN architectures [28], [29] uses depthwise separable convolution layer. The depthwise convo-lution layer is computationally more efficient than a standard convolutional layer. The standard convolution layer performs convolution operation on input volume and combines generated features in one step. The computational cost of standard convolution is D k .D k .M.N.D F .D F [28]. Where D F andD k is the spatial dimension of input feature and kernel size respectively. While M, N are the number of channels in input features and number of convolution filters respectively. To reduce the computational cost, the one-step process is divided into two steps by using factorized convolution also known as depthwise separable convolution.
The first step is depthwise convolution operation performed on each channel of the input feature map separately. two assumptions are made in this step, (1) that the number convolution filter is equal to the number channels of the input feature map and (2) the spatial size of input and output feature maps are the same. If depthwise convolution is performed on input feature map of spatial size D k × D k × M using filter of D F × D F spatial size. Then D k × D k × M × D F × D F multiplication operations are performed in this stage.
Second step is point wise convolution, 1 × 1 convolution is performed across M channels output of depthwise convolution. This help to gain cross channel information and linearly combine the output. If N filters of 1 × 1 dimension is used on D k × D k × M depth wise convolution output. Then D k × D k × M × N multiplication operations are performed. Therefore the computational cost of depthwise convolution is For qualitative comparison consider an image of 100×100×3 is passed through depthwise convolution layer and standard convolution layer. If N = 10, then standard convolution performs 2.7 × 10 7 operations while dethwise convolution perform 5.7 × 10 6 operations which is approximately 4.7 times less than of standard convolution operations.

III. PROPOSED METHOD
In this section, the overall framework of the proposed model is introduced. Followed by a detailed description of model training.

A. Overall Framework of Proposed Model
Proposed face detectors can be divided into two networks, i.e. backbone network and detection network as shown in Fig. 1. The backbone network designed to swiftly reduce the dimensions of the input images without losing information during the process. The backbone network consists of a total of five convolution blocks, the first and third blocks are standard convolution layers with CReLU activation and having large strides. The remaining second, fourth and fifth blocks are depthwise separable convolution block. CReLU is used for its reconstruction property which is of information preserving nature, which leads to features reconstruction power of CNN [13]. CReLU activation is applied by concatenating the linear response of the CNN layer and its negation and passing it through ReLU activation as shown in Fig. 2(a). Mathematically it is defined as: Where ρ c : IR → IR 2 , CeRLU activation and x is linear response of CNN network. From the above equation 1, it can be easily deduced that CReLU activation perverse both negative and positive response. Hence CReLU scheme produces representative features of input data [13]. To reduce the computational complexity depthwise separable convolution block are used. These blocks consist of depthwise convolution followed by batch normalization and ReLU activation as shown in Fig. 2(b). The feature obtained from a backbone network is feed into the detection network for further processing. The detection network is based on the cascade structure of SSD [21]. The model uses features from RF Blocks, which are spatially decreasing but have increasing respective field. Feature maps from different layer form multi-scale feature map to handle faces of variable sizes. RF-block-1, RF-block-2 and RF-block-3 are associated with anchor boxes to detect faces of small, medium and large sizes respectively. Multi-layer, multibranch RF-blocks uses different kernels and dilation rates. This design has the advantage of classifying faces (with facial variation) from a complex background.
RF-blocks consist a bottleneck structure and residual connection as [30], [31]. The first layer of multi-branch design is 1 × 1 convolution, used to reduces the channel in feature maps. Then to reduce the computational cost 3 × 1 and 1 × 3 convolution is used. To increase the non-linearity and effective receptive field, depth-wise separable convolution with different dilation rate are used. Increased non-linearity, generates a more robust feature representation of the input. The increased effective receptive field helps to capture more contextual information for accurate classification. The branches are concatenated and a shortcut path is added to it. Fig. 2(d) shows the detailed architecture of the RF-block. Figure 2b and 2c show the architectural view of convolutional and depth wise separable convolution used in RF-blocks. In the model, each convolution layer is followed by batch normalization and ReLU activation respectively. This is done to reduce overfitting, induce sparsity and to handle the vanishing gradient problem.

B. Implementation Details
The model uses the anchor of 1:1 aspect ratio and densification strategy of [12]. The scale of anchors for RF block-1 are 32,64 and 128, for RF block-2 is 256 and RF block-3 is 512 pixels. The model is trained on WIDER FACE [32] training data set. This dataset consists of 12880 training images with different sizes, occlusion and blurriness levels. The training data is prepared by removing extremely small faces (height or width less than 15 pixels), heavily blur and occlude faces. For data augmentation, different strategies like random cropping, horizontal flipping, scale transformation and colour distortion are used during training. During training, the ground truth anchor boxes are matched to the predicted bounding box if the jaccrad index is more than 0.40. The multi-box loss objective function [21] is used in a training. It is a weighted sum of cross-entropy loss for bounding box confidence and smooth L-1 loss for bounding box coordinate regression. It is defined as: where, L(c i , d i ) is multi-box loss for given c i confidence score of i th bounding box with d i coordinates. L cls (c i , c * i ) is cross entropy loss between predicted confidence score c * i bounding box and ground truth confidence score c i . L reg (d i , d * i is smooth L1 loss for predicted and ground bounding box coordinates. λ is hyper parameter used to balance the sum of losses (λ = 2 is used for training the network). Model is trained using batch size 32 for 280 thousand iterations. SGD optimizer used in training have 0.9 momentum, 5 × 10 −4 weight decay. Model is trained using variable learning rates of 10 −3 , 10 −4 and 10 −5 for 160K iterations 10 −3 , 80K and 40K iterations respectively. The model is implemented using PyTorch framework 1 .

IV. RESULTS AND DISCUSSION
In this section proposed face detection algorithm is evaluated on the benchmark datasets, followed by speed comparison with available lightweight models.

A. Experimental Setup
The proposed method is implemented using Pytorch version 1.6.0 on i5-8330H@2.30GHz processor system with 16 Gigabytes RAM and NVIDIA GTX 1060 GPU (Graphical Processing Unit).

B. Evalution on Benchmark Dataset
The proposed algorithm is evaluated on three benchmark face detection dataset, WIDER FACE [32], Pascal Face [33] and AFW [34] [33]. The proposed method is compared with other state of art lightweight detector using Average Precision (AP) percentage metric and PR (Precision-Recall) curves.

1) WIDER FACE Dataset:
The dataset contains total 32203 images of faces different pose, scale, facial expressions and illumination. The dataset contain training and validation set. Validation set have three subsets validation data based on difficulties level of face detection, these are easy, medium and hard. The proposed method is trained on training set and validated results on all three validation subsets. Proposed method is validated against baseline methods [3], [4], [35], [36] and other methods [11], [12], [22], [37], [38], [39], [24]. Table I shows the results of performance comparison proposed methods with other methods. The proposed method shows the better result on easy and medium validation set, comparable result on hard dataset. This could be due to the fact that the network was trained on face which have height or width greater than 15 pixels and heavily occlude and blur faces were removed from the training set. Fig. 3 shows the PR curve of the proposed method compared against base line methods.
2) AFW and PASCAL Face Dataset: AFW dataset is Flickr images collection of 205 images with 473 face annotation. Table 2 shows the performance comparison of proposed method with standard methods using mAP% metrics. The  proposed method shows the better performance then other methods. [4], [12], [33], [41], [42]. Fig. 4(a) shows PR curve for proposed method,standard methods and commercial face detectors(Face.com, Face++ and Picasa).
Pascal Face dataset is formed from pascal person layout dataset. It contain 851 images with 1335 face annotations. The comparison of proposed method with standard dataset [3], [4], [12], [33], [42], [34] is given in Table II. Fig. 4(b) shows the PR curve of proposed method, standard method and commercial methods. Proposed method showed better results on dataset.

V. CONCLUSION
This paper introduces a fast and high performing face detector. The high processing speed is achieved by using a lightweight backbone network. The feature extractor rapidly reduces the size of input without losing information during this process. The information is retained by the CReLU activation function. The performance of the face detector is achieved by efficiently utilizing the feature maps obtained from the feature extractor. The detector having RF blocks imitating the human visual system. As the results suggest the proposed method works well on the images with images having faces of the height of more than 15 pixels. The model limitation to detect tiny faces and heavily occluded faces. The proposed model can further be compressed using CNN optimization techniques such as pruning. The experiments performed using proposed face detectors on benchmark datasets has shown good results and have high processing speeds on both CPU and GPU devices.