Camera model identification based on forensic traces extracted from homogeneous patches
Keywords
1. Introduction
An intriguing question in digital image forensics is that given an image, would it be possible to identify the source camera which was used to capture it? This question is of prime importance to Law Enforcement Agencies (LEAs) when investigating digital image and video information. The last two decades have seen rapid growth in the usage of the Internet, and unfortunately, this has also led to an increase in the circulation of illicit content of minors especially in darknets. As part of their investigations, LEAs aim to identify the source of such content because knowing the source camera used to capture such content will help them gain additional intelligence in building stronger cases against suspected offenders. Source Camera Identification (SCI) is an important low-level problem in the field of computer vision and plays a crucial role in the forensic investigations of digital images. The methods of SCI can also be used in a few related applications such as image forgery detection (Bondi et al., 2017, Cozzolino and Verdoliva, 2019, Li et al., 2014) for fighting fake news, and image integrity verification (Li et al., 2009) for verifying digital evidence presented to the courts of law, among others. Our work is part of the EU-funded 4NSEEK project to develop forensic tools for LEAs that will help them further to fight against child sexual abuse.
In this work, we aim to identify the source camera model of an image by examining only the pixel values. Clues for identifying the source camera can also be gathered from the metadata that comes in the format of the Exchangeable Image File (EXIF) header. However, this information can be modified when the image is re-saved in a different format or when it is re-compressed. Moreover, any tampering with the information in the EXIF headers cannot be detected. This issue makes the information in the EXIF headers unreliable for the task of SCI, and we, therefore, avoid using them. In contrast, when the pixel values are altered in an image, research has shown that it is possible to detect such image tampering. For example, it is possible to identify some common image processing operations such as median filtering (Kang et al., 2013, Kirchner and Fridrich, 2010), Gaussian filtering (Fan et al., 2015, Kang and Wei, 2008), and JPEG image compression (Kang and Wei, 2008, Luo et al., 2010, Wang and Zhang, 2016), among others. Therefore, relying on the pixel values makes SCI more robust when compared to a system that depends on image meta-data.
Another approach for SCI involves the examination of digital watermarks. While this is a very effective approach, in practice, such watermarks are not embedded into images by most consumer-grade cameras. Digital watermarking is mainly used for identifying copyright infringements in commercial digital media, but it has limited applicability for SCI.
With smartphones becoming a common commodity, and image editing tools gaining more popularity, it is getting even more challenging to identify cameras from images. Operations such as image resizing, filtering, and compression, among others, not only tamper with the sensor pattern noise but also leave behind operation-specific traces that are embedded into the image. While the presence of these additional traces makes SCI more difficult, it is still possible to detect their presence (Fan et al., 2015, Kang et al., 2013, Kang and Wei, 2008, Kirchner and Fridrich, 2010, Luo et al., 2010, Wang and Zhang, 2016). Detection of image forgery is an important forensic task, but falls beyond the scope of this work, as we consider only unedited images for this study.
It is necessary here to clarify what is meant by camera-model and camera-device identification. These are two main approaches for SCI. In model identification, the task is to identify the specific camera model that was used to capture an image (for example, an iPhone 10, an iPhone 11, etc.). We refer to the manufactured instances of the same camera model as devices. Device identification is more challenging and is also of high interest to forensic investigators. A prerequisite for an approach with good performance for device identification is a method that performs very well first and foremost for model identification. In this work, we focus on camera model identification.
The key contributions of our work are threefold and can be summarized as follows. Firstly, we observed that during the extraction of camera features from an image, it is important to ensure that the features correspond to the processing noise and not the scene details. This is more essential when employing neural networks for feature extraction. Therefore, in order to prevent any bias due to the involved scenes, we decided to work only with homogeneous patches. Most of the images contain such regions, and we propose a method to identify and extract homogeneous patches from an image. The proposed method for patch selection ensures that the learning process is not influenced by the scene content in the images. Secondly, we propose a systematic hierarchical scheme for data balancing which is more suited to the problem of camera identification. Finally, we propose a hierarchical approach for classification, where we perform brand classification in the first level followed by the classification of camera models in the second level (refer to Fig. 1). We empirically show that this hierarchical scheme is more effective than non-hierarchical classification for SCI, among other advantages. We also share the source code1 for further dissemination of our approach and experiments.
The rest of the paper is organized as follows: The following section describes related methods and summarizes the state-of-the-art for SCI. The proposed methodology is described in Section 3. Section 4 elucidates the systematic experiments performed along with the obtained results. The discussion on our experiments is presented in Section 5. Finally, the main conclusions are drawn in Section 6.
2. Related works
Ever since digital cameras gained popularity, SCI has become an important research problem for forensic analysis of digital media. Camera identification from images becomes possible due to artefacts introduced during the generation of a digital image. Fig. 2 captures a high-level pipeline for image generation inside a digital camera, which involves a sequence of hardware and software processing steps. The light rays from the scene enter the camera through a set of lenses, followed by an anti-aliasing filter, colour filter arrays (CFA), and finally the imaging sensor. The sensor converts the analog signal to digital and produces a RAW image. This image is further enhanced and typically involves processing steps such as demosaicing, gamma correction, compression, and other operations before the generation of the final image.
The exact implementation of these processing steps varies for each camera model and leaves behind a unique processing trace during the generation of a digital image. Furthermore, two camera devices of the same camera model also generate a slightly different processing noise. This intra-model variance is due to the unique random noise generated by each imaging sensor in every device. Though challenging to extract, the noise generated by the sensor makes camera device identification possible. Note that, we interchangeably use the terms camera traces, artefacts, and processing noise as they refer to the same thing.
The final image produced can therefore be considered as a combination of scene details and the in-camera processing noise. An overview of the classification of this noise is shown in Fig. 3. The camera noise consists of sensor noise, CFA artefacts, compression artefacts, and image post-processing artefacts, among others. Besides sensor noise, all other components are deterministic and when combined can identify the specific camera model. Lukas et al. (2006) further classified the sensor noise into shot noise and pattern noise. Shot noise is an additive random component and can be removed by frame averaging. The pattern noise is a multiplicative deterministic component that is unique to each camera device. The pattern noise further consists of fixed pattern noise (FPN) and Photoresponse non-uniform noise (PRNU). FPN is an additive component caused due to dark currents, i.e. when the sensor is not exposed to any light. PRNU is a multiplicative noise caused due to the different sensitivities of pixels when illumined with a light source. This component is primarily responsible for device identification. For an elaborate classification of sensor noise we refer the reader to Lukas et al. (2006).
One of the earliest experiments was conducted by Kurosawa et al. (1999), where the FPN generated by the dark frames were studied. The authors examined camera devices from models and showed that the FPN is unique to every camera device. This uniqueness is attributed to the imperfections in the silicon wafer fabrication process. As the generation of the dark frame requires access to the physical camera device, Lukas et al. (2006) proposed a method to directly extract the PRNU noise from natural images. In their work, a high-pass filter was used to extract the camera fingerprints, which were in turn compared using correlation measures to determine their source identity. Further methods based on PRNU were presented by Chen et al., 2008, Li, 2010, Lin and Li, 2015, Rosenfeld and Sencar, 2009.
Methods for camera identification were also proposed that aimed to extract the artefacts introduced by a specific set of processing operations during the image generation inside a camera. Kharrazi et al. (2004) extracted handcrafted features and used an SVM for the classification of images. These handcrafted features were designed to target a specific set of forensic traces. Several approaches were also proposed that specifically target the demosaicing artefacts, which are related to the colour-filter arrays (Bayram et al., 2005, Cao and Kot, 2009, Chen and Stamm, 2015, Swaminathan et al., 2007). Some approaches use non-trainable features (Xu & Shi, 2012) which cannot be adapted for newer camera models, while other approaches consider a particular camera noise model (Thai et al., 2013). All these methods aim to extract features by assuming that the artefacts were mainly introduced by a subset of processing steps, and such methods are referred to as the ones based on a closed set of forensic traces. Such methods require expertise and effort to design feature descriptors for a specific processing step. The drawback of these approaches is that they might miss out on some non-intuitive forensic traces, for example, those that arise because of a combination of such processing steps. In contrast, we do not make any assumption on the specific source of artefacts and therefore consider all possible forensic traces. Methods that do not make any assumptions about the source of forensic traces, like ours, are referred to as the ones based on the open set of forensic traces. These methods are more promising in extracting a wide variety of camera fingerprints.
Recently, methods based on deep learning were presented to address a variety of image forensic tasks. These methods help in overcoming the problems associated with feature engineering, which is more suited for the extraction of open set of forensic traces. Some image forensic tasks that were recently approached using deep learning include the detection of image in-painting (Wang et al., 2019, Zhu et al., 2018), median filtering (Chen et al., 2015, Tang et al., 2018), resizing (Bunk et al., 2017), and JPEG compression (Barni et al., 2017, Barni et al., 2016), and suchlike. Though these methods are based on deep learning they are characterized by a closed set of forensic traces. We are, however, interested in methods using deep learning and based on an open set of forensic traces (Bayar and Stamm, 2016, Mayer et al., 2018).
Bondi et al. (2016) showed that Convolutional Neural Networks (ConvNets) can be used to extract forensic traces from images to identify source camera models. They performed experiments on the publicly available Dresden data set (Gloe & Böhme, 2010) and showed that ConvNet based methods achieve state-of-the-art performance when compared to other approaches. A similar study using ConvNets was also performed by Tuama et al. (2016). Bayar and Stamm (2018b) proposed a system based on an open set of forensic traces, which indicates whether an image was captured by a camera model used during training, or from the collection of camera models outside the training set.
Every image generated by a digital camera consists of scene details along with camera processing noise, which is embedded into the image. This raises an important question. Can we extract camera traces from an image ignoring the scene details? This is a challenge, especially in the context of ConvNets, where we allow the model to learn its own feature extractor. In order to prevent the ConvNet from learning high-level features, some methods have been proposed (Bayar and Stamm, 2018a, Timmerman et al., 2020) to suppress the scene content as a pre-processing step. These methods, however, do not fully prevent the ConvNet from learning the high-level scene details. We address this issue by choosing to extract patches from image regions that are homogeneous. Further details of our patch filtering and selection strategy are presented in Section 3.
We further note that training a hierarchy of classifiers is more suited than training a single classifier for SCI. In machine learning, training a highly effective single classifier becomes a challenge when there are multiple camera devices. To overcome these challenges, we propose a hierarchical classification scheme as shown in Fig. 1. A more detailed account of this scheme is presented in the following section.
3. Methodology
This section describes our proposed approach, where we use hierarchical classification based on deep learning for source camera model identification. We also describe in detail the data preparation steps necessary for our pipeline. A high-level overview of the proposed methodology is presented in Fig. 4. The pipeline highlights three major steps. Firstly, we determine the homogeneous regions in an image. This is done by dividing the input image into overlapping blocks of size 128 × 128 pixels. These blocks are then subjected to the proposed homogeneity criteria, in order to filter out non-homogeneous patches. In the second step, a pre-trained ConvNet is used to determine the camera brand for each of the homogeneous patches. These predictions are then combined using a majority vote to determine the camera brand. Finally, based on the determined brand the corresponding model-level classifier is used to determine the specific camera model for each of the homogeneous patches. The patch-level predictions are then combined by performing a majority vote which determines the source camera model for the given input image. We begin by describing the methodology for patch selection. As described earlier, every image consists of both scene details and camera processing noise. Image regions can be classified into regions with high-level scene details (that is, regions containing edges, high texture details, etc.) and low-level scene details (that is, regions that are homogeneous, where a group of neighbouring pixels have almost similar intensity values). The latter type of region contains camera processing noise that is least distorted by high-level scene details. We now describe the methodology to extract homogeneous patches from such image regions. In order to extract patches from an image we tile the input image with blocks of size 128 × 128 pixels. Using a larger block size, such as 256 × 256 pixels, would result in fewer blocks, and the chances of a block being homogeneous are reduced. Using a smaller block size, such as 32 × 32 pixels, or 64 × 64 pixels, would result in more image patches that are homogeneous, however, extracting camera-specific features from smaller image patches is more difficult. To account for this trade-off, we sample the input image using a tile of size 128 × 128 pixels, and with a stride of 0.25 times the block size. This effectively increases the number of extracted image regions substantially, while retaining a sufficiently large image block. Each block is then examined for homogeneity by determining the standard deviation of its pixel values. Since there are three colour channels, we determine three standard deviations, one for each channel. Fig. 5 depicts a sample input image in part (a) and the corresponding standard deviations of each image block are shown in part (b). Note that for the sake of clarity, we show the standard deviations for the red colour channel and use non-overlapping patches with a stride of 1 block. The standard deviation values of the homogeneous regions are smaller than the others. We subjectively determined a high threshold with a value of 0.02 to exclude non-homogeneous patches with significant scene level details. This threshold was determined after manually examining hundreds of extracted patches. We also set a low threshold of 0.005 to eliminate saturated patches. A saturated patch is one that has true loss of data due to having many pixel values clipped to the maximum intensity (i.e. 255), thereby overriding the camera noise. This occurs when there is very high incoming light intensity to the sensor or when certain operations (e.g. denoising) are used by editing software. In the latter case, such operations may result in a few regions where the difference between the neighbouring pixels is close to zero. Finally, for a patch to be considered homogeneous, all three standard deviations must independently adhere to the threshold limits defined above. Sample image patches for saturated, homogeneous, and non-homogeneous are shown in Fig. 5(c–e). In our experiments, described in Section 4, we determine the number of homogeneous patches to be extracted from each image as follows. Suppose a given number of patches need to be extracted. If the number of homogeneous patches for an image is more than , then we uniformly sample patches so that they are evenly distributed among the homogeneous regions across the whole image. On the other hand, when there are fewer than patches per image, we choose all the homogeneous patches and choose the remaining patches from the saturated and non-homogeneous patches with the lowest standard deviations. Thus, we ensure that homogeneous regions are given priority during the patch extraction. Finally, we subtract the per-colour-channel mean from the respective colour channel of the input image. This is done to minimize the colour information being inadvertently learnt for classification. Fig. 6 illustrates this pre-processing step and demonstrates that the effect of brightness in the input patches is reduced by this operation. Thereby, the classifier focuses on the noise, which is our matter of interest. Fig. 7 depicts the ratio of the average number of homogeneous to non-homogeneous to saturated patches for each camera model in the Dresden data set. Further details concerning the camera models are presented in Table 1. The patch selection approach that we propose enable us to process images of different dimensions. In contrast, methods that use a full image (Bennabhaktula et al., 2020) may have to do resizing first, which can lead to unintended modifications to the forensic traces. Moreover, our approach of choosing patches gives the ability to choose regions that are less affected by the scene content. Next, we describe the details of the ConvNet architecture used for patch classification. For each patch to be classified into its source camera brand and source camera model, we use a ConvNet for feature extraction that gives input to a fully connected neural network for classification. The ConvNet architecture that we use is inspired by the MISLNet, which was proposed by Bayar and Stamm (2018a). The overall design of the proposed architecture is shown in Fig. 8. The ConvNet architecture is divided into seven blocks. The dimensions of the input layer, 128 × 128 × 3, correspond to the patch size used in our experiments. This is followed by four convolutional blocks and three fully connected blocks. Bayar and Stamm (2018a) used a constrained convolutional layer in the design of the MISLNet to suppress the high-level scene details. As our approach relies on the homogeneous regions in the images, we skip the constrained convolutional layer. Note that we make use of all three input colour channels (RGB) to extract better forensic traces instead of relying on monochrome images as done by Bayar and Stamm (2018a). As shown in Fig. 8, the first block consists of convolutional filters each of size 7 × 7 × 3, which are configured to use a stride of 2 × 2 and a valid zero padding. This convolutional layer is followed by a batch normalization layer (Ioffe & Szegedy, 2015), which in turn is followed by a Rectified Linear Unit (ReLU) activation to introduce non-linearity. Unlike MISLNet, we use ReLU activations for the convolutional layers instead of the tanh activations. The activation layer is followed by a max-pooling layer of size 2 × 2, which effectively reduces the spatial dimensions by a factor of . We follow a similar pattern for blocks 1–4, where each block consists of a convolution layer, batch normalization, and ReLU transformation followed by a max-pooling layer. The exact details of these blocks are given in Fig. 8. In the proposed ConvNet, blocks 1–4 constitute the feature extraction part of the network, while blocks 5–7 represent the classifier. The output of the feature extractor block is flattened to obtain a element feature vector. We use a fully connected neural network with three layers to classify these features into the desired camera class. In the case of brand classification, the output of the classifier is the camera brand (for example Sony, Canon, Nikon, and so on), while in the case of model classification the number of units in the output layer is set to match the number of camera models (for example Nikon_CoolPixS710, Nikon_D200, or Nikon_D70, among others). In order to avoid overfitting and achieve regularization, we use a dropout layer with a dropout factor of 0.3. The learning parameters, loss function, and data set used for the experiments are described in Section 4. An image may contain many homogeneous regions. Based on the patch selection scheme numerous homogeneous patches are extracted from the image. For each patch in a given image, we apply the concerned ConvNet model which gives us a label. All such patch-level predictions are then combined by taking a majority vote to determine the label for the given test image. As demonstrated in Section 4 this scheme of majority voting significantly decreases the error rate for image-level prediction in comparison to patch-level predictions. To classify an input image to its source camera model, we propose to follow a two-level hierarchical classification approach as shown in Fig. 1. The first level is concerned with brand classification while the second one is for model classification. The training of these classifiers at both brand and model level are performed independently of each other. The trained classifiers can then be used in a hierarchical fashion to perform predictions on images. In order to predict the source camera model for a single image, firstly, homogeneous patches are extracted. Secondly, the brand level classifier determines the brand for each homogeneous patch. In case the majority vote on the predictions corresponds to only one camera model, the brand classification trivially determines the corresponding model-level classification. On the other hand, if the predicted brand corresponds to multiple camera models, then a second-level classifier is used to determine the class labels for the concerned patches. This step is followed by a majority vote to determine the predicted camera model for the given image. The hierarchical scheme takes the advantage of training with fewer camera devices for each classifier. Since each classifier in the hierarchical scheme accounts for only a limited set of camera models, the training time to learn each classifier is reduced considerably. This also makes it possible to independently train each classifier in parallel. Moreover, with this modular approach, issues with any brand-specific classifier do not impact other brand-specific classifiers, and the classification scheme can be extended to include additional camera devices without having to retrain all classifiers.3.1. Overview
3.2. Homogeneous patch selection and extraction
Sr. no. Camera model name # Devices Total # images 1 Canon_Ixus70 3 522 2 Casio_EX-Z150 5 850 3 FujiFilm_FinePixJ50 3 630 4 Kodak_M1063 5 2314 5 Nikon_CoolPixS710 5 846 6 Nikon_D200 2 673 7 Nikon_D70 4 676 8 Olympus_mju_1050SW 5 965 9 Panasonic_DMC-FZ50 3 931 10 Pentax_OptioA40 4 638 11 Praktica_DCZ5.9 5 942 12 Ricoh_GX100 5 854 13 Rollei_RCP-7325XS 3 544 14 Samsung_L74wide 3 641 15 Samsung_NV15 3 599 16 Sony_DSC-H50 2 541 17 Sony_DSC-T77 4 906 18 Sony_DSC-W170 2 405 3.3. Patch classification
3.4. Majority voting
3.5. Hierarchical classification scheme
4. Experiments
This section describes the data set used, along with the experiments performed to train and test the proposed hierarchical approach. To conduct our experiments we used the publicly available benchmark Dresden data set (Gloe & Böhme, 2010). It consists of more than 14,000 images captured by camera devices, camera models, and major camera brands. The images are divided into several subsets, namely ‘JPEG’, ‘natural’, ‘flat-field’, and ‘dark-field’, among others. We conducted our experiments on the ‘natural’ subset of the data set, which contains camera devices that were used to capture different indoor and outdoor scenes. This high number of diverse scenes makes the ‘natural’ subset the most challenging and realistic one, unlike the other subsets which contain images from at most two different scenes. The methods that we compare our results with also use the same ‘natural’ subset of Dresden. Of the camera devices, devices were present with a single instance of a camera model. In order to perform a fair evaluation for camera model identification, we decided to consider devices that have at least two instances from the same camera model. This requirement allows us to keep one instance aside for testing while using the rest for training. We, therefore, conduct experiments on the remaining devices. Together, they represent camera models, whose distribution is listed in Table 1. As suggested by Kirchner and Gloe (2015), we combined the models Nikon_D70 and Nikon_D70s into a single camera model (Nikon_D70) as they correspond to the same model with a different lens. The natural images can be further classified into sub-categories, based on the scene content. In order to perform a reliable evaluation, Bondi et al. (2016) and Rafi, Wu, and Hasan (2020) used a subset of scenes for testing, while the rest were used for training and validation. That approach ensured that the model’s accuracy is scene independent. In our experiments we do not need this setting as the selection of homogeneous patches ensures that the scene-specific details are omitted. In order to train a robust machine learning model it is necessary to ensure that the data imbalance is appropriately handled during the training phase. This becomes essential for our study as the distribution of the number of images per camera model varies significantly between classes. We refer the reader to Table 1 for the exact details of image distribution per camera model. Traditional data balancing techniques such as over-sampling and under-sampling of images from camera models may not result in a representative data set. More specifically, over-sampling causes repetition in the training data which can lead to model overfitting. Random under-sampling of data, on the other hand, can lead to missing out on potentially useful data and, therefore, we propose a systematic top-down scheme for patch balancing. Consider the data imbalance problem for brand classification. Let the total number of image patches that we would like to train be denoted by . The goal is to distribute these patches evenly across the training data set. The value of is an estimate that will drive the rest of the patch balancing algorithm. Suppose that brand classification is an -class classification problem, where represents the total number of camera brands. Without loss of generality, let represents the number of patches to be sampled from any specific brand . In order to construct a balanced data set, can be determined as: (1)where , , , and denotes the standard rounding function in Python.2 Let denotes the number of models representing the brand . Without loss of generality, let denotes the number of patches to be sampled from a particular model . In order to keep the number of patches the same across different models of the same brand, can be set to: (2)where , . We continue this process to determine the number of patches to be sampled at the device-level , for all the devices of model . The value of is given by: (3)where , . Finally, we determine the number of patches to be sampled from each image , captured by the device as: (4)where represents the number of images in the data set belonging to the device . Thus, for an initial estimate of , the number of patches that need to be extracted from an image is determined by , where , , and correspond to the device, model, and brand of the image, respectively. It should be noted that these patches are chosen by following the patch selection algorithm presented in Section 3.2. The proposed method for patch selection ensures that the patches are evenly distributed at all levels of the hierarchy. Fig. 9 illustrates the patch balancing algorithm. Based on the choice of , there might be very minor differences in the number of patches between different classes which is caused due to the rounding function. For our setting, this minor difference in class distribution is acceptable and is not considered a data imbalance. In line with Kirchner and Gloe (2015) we perform a 5-fold cross-validation, where we leave out one device per model for the test and use the remaining for training. For each fold, the device to be considered as a test device is determined in a round-robin fashion. As shown in Table 1, the number of camera devices per model is not the same. The maximum number of devices per model is , and for such models, each device appears exactly once in the test set across the folds. For models where there are or fewer devices, we repeat the devices in the test set following a round-robin approach across the folds. This strategy of cross-validation accounts for the data set bias and provides us with reliable results. We report the results for all the folds, along with the global average. As a first step, we perform camera brand classification. The data set that we consider consists of camera brands. Firstly, we balance the patches by setting , which gives us 20,000 patches per class. The training loss is determined by computing the categorical cross-entropy between the target and the predicted output of the network. Let represents the network with weights , input example , and the corresponding softmax output as . The categorical cross-entropy is a multi-class logistic loss function which is defined as: (5)(6) where denotes the loss function, denotes the one-hot encoded target vector for the input , and is the empirical loss for the mini-batch of size . In our experiments, we set the batch size . We use the technique of stochastic gradient descent (SGD) for the optimization of the model parameters with a learning rate of 0.1 and momentum of 0.8. We were able to set such a high initial learning rate because of the batch normalization layers in the ConvNet. Furthermore, an exponential learning rate decay is used with a multiplicative factor of 0.9 for achieving model convergence. The learning rate is decayed at the end of every epoch. In order to avoid overfitting, we used l2-regularization by setting the weight decay factor to 0.0005. With these settings, the model was trained until convergence. We perform early stopping when the loss converges, and it does not fluctuate more than 0.02 for consecutive epochs. Fig. 10 shows the convergence in the loss that was achieved with these hyper-parameters. The best epoch is then determined with the maximum validation accuracy for each fold and used during the evaluation phase of the brand classifier. The proposed ConvNet model consists of trainable parameters (for ). Fig. 11 shows the confusion matrix for the classification of camera brands. We show the confusion matrix only for the first fold (refer to Fig. 11) and report a summary of the remaining folds in the first row of Table 2. Note that the results in Table 2 correspond to image-level classification accuracy and not at the level of patches. During the training phase, the training data set is balanced in a hierarchical fashion as shown in Fig. 9. This is, however, not the case during model evaluation. For a given test image, a set of homogeneous patches is extracted. The trained model is then used to predict class labels for all homogeneous patches. Finally, a majority vote is taken to determine the predicted class label for the given image. Taking the majority vote on patch-level predictions decreased the average error rate by almost percent for image-level predictions in comparison to individual patch-level predictions. Since the test data is skewed between classes, in addition to accuracy we also report the macro F1 score. It is defined as the empirical average of class-wise F1 scores, with the F1 score for each class defined as: (7)where TP, FP, and FN stand for true positives, false positives, and false negatives, respectively. The accuracy values reported in our experiments have been determined using the following equation: (8) We achieved an average accuracy of percent and an average macro F1 score of for brand-level classification. Having trained and tested the brand classifier, we now discuss similar aspects for model-level classifiers. In order to train model-level classifiers, we consider only those camera brands which have multiple camera models. Of the camera brands, only brands have multiple camera models, namely Nikon, Samsung, and Sony. We use the same ConvNet architecture for training model-level classifiers as described in Section 4.4. In order to create a level-balanced training data set for model classification, we need to determine the number of patches to be extracted from each image. This can be determined using Eq. (4) and by setting (since in model-level classification we are dealing with brand per model). Therefore, for model-level classification, the number of patches to be extracted from each image is determined by: (9) For the training of the Nikon, Samsung, and Sony classifiers, we choose , 40,000, and 60,000 patches, respectively. We set these values with the idea of extracting 20,000 patches from each camera model. Based on the mentioned values for , the data sets are prepared for training the respective classifiers. We continue to perform -fold cross-validation and ensure that the distribution of camera devices remains consistent between the folds of the brand and model-level classification. Moreover, the training of each model-level and the brand-level classifiers can be performed in parallel, as there is no dependency during the training phase. Using the same hyperparameters, which were used to train the brand classifier, we train each model-level classifier for about epochs. The model-level classifiers that achieve the highest validation accuracy rates are chosen as the final ones. Fig. 12 shows the confusion matrix for each model-level classifier. They are generated on the test data for the first fold. The Nikon classifier achieves an average accuracy of percent across all folds. Similarly, the Samsung and Sony classifiers achieve an average accuracy of and percent, respectively. As in the case of brand classification, taking a majority vote on patch-level predictions to determine the image-level prediction reduces the error rate by almost 60 percent. For complete details regarding the reduction in error rates, refer to Table 4. Finally, we determine the macro F1 score to account for the imbalance in the test data set. The Nikon, Samsung, and Sony classifiers achieve an average F1 score of , , and , respectively. We refer the reader to Table 3 for complete details concerning the macro F1 scores. The end-to-end hierarchical evaluation pipeline is described in the following sub-section. Having trained the brand-level and model-level classifiers, we now use them in a hierarchical fashion to predict the source camera model for a given image. We begin by extracting homogeneous patches from a given image. At the top of the hierarchy is the brand-level classifier, which selects the brand for each of the patches. A majority vote is performed to determine the source camera brand. In the trivial case, when there is only one camera model for a particular brand, the brand classification directly determines the model classification. Otherwise, based on the predicted brand, the corresponding model-level classifier is used to determine the source camera model. In our case, we trained three different model-level classifiers, one each for the Nikon, Samsung, and Sony brands. A majority vote is once again performed on all patch predictions that determine the source camera model for the given image. The first fold results of the hierarchical evaluation are presented as a confusion matrix in Fig. 13. We achieve an overall classification accuracy of and an overall macro F1 score of across all five folds. Table 2, Table 3 summarize the accuracy and the macro F1 score achieved for each fold, respectively. In order to compare our method with the state-of-the-art, we choose methods that have been evaluated on the same ‘natural’ subset of the Dresden data set. The summary of this comparison is presented in Table 5. Similar to the works in Bondi et al., 2016, Rafi, Tonmoy, et al., 2020, Rafi, Wu, and Hasan, 2020 we follow the leave-one-device-out strategy for cross-validation, and show the results in Table 5. Our strategy of homogeneous patch selection ensures to ignore scene details from the image, which allows us to avoid inadvertent classification of the scene content. It was, therefore, not necessary to employ the scene-independent test set strategy proposed by Bondi et al. (2016). Marra et al. (2017) performed experiments on camera models of which models have only one device per model in the Dresden data set. In their setting, it is not possible to test the generalizability of the trained classifier for those devices. Among all the proposed methods that use camera models on the Dresden data set, we achieve the best classification accuracy and reduce the classification error rate by 46.49 percent compared to the previous best result achieved by Rafi, Wu, and Hasan (2020).4.1. Data set
4.2. Data balancing
4.3. Data set split
4.4. Brand classification
Classification fold 1 fold 2 fold 3 fold 4 fold 5 Average Brands 0.995 0.992 0.994 0.994 0.995 Nikon 0.997 0.997 0.999 0.997 0.997 Samsung 1.000 0.995 0.998 1.000 0.995 Sony 0.970 0.989 0.965 0.980 0.972 Hierarchical 0.991 0.991 0.989 0.990 0.989 4.5. Model classification
4.6. Hierarchical classification
Classification fold 1 fold 2 fold 3 fold 4 fold 5 Average Brands 0.993 0.992 0.992 0.992 0.993 Nikon 0.997 0.997 0.999 0.997 0.997 Samsung 1.000 0.995 0.998 1.000 0.995 Sony 0.971 0.989 0.967 0.981 0.973 Hierarchical 0.989 0.991 0.988 0.990 0.990 Classification fold 1 fold 2 fold 3 fold 4 fold 5 Average (%) Brands 49.2 40.8 47.1 50.9 52.7 Nikon 30.3 77.1 65.7 70.2 35.2 Samsung 100 29.4 25.5 100 31.1 Sony 68.0 88.5 66.6 79.5 69.4 Method Accuracy No. of models Evaluation scheme Tuama et al. (2016) 0.9709 14 5-fold CV Marra et al. (2017) 0.9627 25 20-fold CV with 7 devices used for both train and test sets Marra et al. (2017) 0.9872 25 20-fold CV with 7 devices used for both train and test sets Bondi et al. (2016) 18 Scene independent test set Rafi, Tonmoy, et al. (2020) 0.9703 18 Scene independent test set Rafi, Wu, and Hasan (2020) 0.9815 18 Scene independent test set Ours - patches Leave-one-device-out -fold CV Ours - patches Leave-one-device-out -fold CV
5. Discussion
This section describes further experiments that were conducted to support our hypothesis for SCI. The results reported thus far were achieved by using homogeneous patches per image during the test phase. In order to understand the number of patches required for evaluation, we conducted a series of experiments which are summarized in Fig. 14. As can be seen, certain brands can work with a few patches while others require more patches to obtain high accuracy. This behaviour is evident with the Sony classifier, which significantly improves in performance with an increasing number of patches. Interestingly, the performance of the Samsung classifier slightly decreases when the number of patches is increased from to . This minor deviation from the expected trend is, however, so small that it is probably due to chance. In general, classification becomes progressively robust with increasing number of randomly selected homogeneous patches, up to a certain point. The extraction of too many patches would lead us into the regime of non-homogeneous patches. Therefore, for the Dresden data set, we recommend performing prediction using patches. Based on the homogeneous patch distribution shown in Fig. 7 we can conclude that most of the images (from all camera models) contain at least homogeneous patches of size 128 × 128 pixels. In fact, in Fig. 14 one can observe that the classification accuracy of the Sony classifier decreases on considering patches when compared to using only patches. This decrease in accuracy may be due to the inclusion of non-homogeneous patches when we extract patches (Fig. 7). As the Dresden data set constitutes images belonging to diverse scenes, it is reasonable to assume that most natural images contain about homogeneous patches. It is also important to account for the time taken to extract such patches. In our experiments, the extraction of patches from an image of size 3072 × 2304 pixels with patches of size 128 × 128 pixels took approximately 2.16 seconds when measured on a single core of Intel Xeon E5-2680 v3 CPU (2.5 GHz). Because we use the integral image for patch selection and extraction the execution time primarily depends on the image size and not on the number of homogeneous patches. One of our hypotheses was that a hierarchical approach is better than using a single classifier (flat approach). In order to evaluate this hypothesis, we conducted experiments by training all the camera models using a single classifier. For a fair comparison, the ConvNet parameters were kept unchanged and were set to match our earlier experiments. We evaluated the flat ConvNet by using homogeneous patches and performed a -fold cross-validation. The results of these experiments are summarized in Table 6. We obtained an average classification accuracy of . In comparison to the hierarchical approach, the average error rate increases by more than percent for the flat approach. The flat classifiers also took more epochs to converge when compared to their hierarchical counterparts. This difficulty in model learning is due to the mixing up of different camera models and brands into a single classifier. When using the hierarchical approach the brand classifier accounts for inter-brand noise variation, while the model-level classifiers account for the intra-brand noise variations. This separation of noise variations allows the hierarchical classifiers to learn better features for classification. Notable is also the fact that the results obtained by the flat approach are also better than existing works (Table 5). The proposed hierarchical approach has other advantages. It trivializes parallelization, as each classifier in the hierarchy can be trained independently of others. Moreover, this approach is robust in handling the addition of new camera models. In such cases, only the specific model-level classifier needs to be modified without having to update other classifiers. This not only saves computation time during the initial training but also during future updates. Certain brands can achieve high accuracy with fewer patches. For instance, referring to Fig. 14 we could use only patches per image to determine the camera brand with around 99.5 percent accuracy. This is also the case with Nikon and Samsung classifiers. The Sony classifier, however, achieves the highest classification accuracy with patches. One may, therefore, exploit this result to use the optimal number of patches per model in order to also maximize the speed of the evaluation process, something that would not be possible with a single-classifier approach. One direction for future research is the evaluation of the proposed approach to varying levels of image quality. In such evaluation one may consider no-reference image quality measures (Gu et al., 2017, Gu et al., 2016, Mittal et al., 2012), which can be used to determine the visual quality of images, among others (Gu, Li, et al., 2017, Gu, Zhou, et al., 2017). Using this information, the relationship between the input image quality and the resulting performance for SCI can be determined. This information could be used to assess whether a given image has sufficient quality for a reliable determination of its camera model. Another direction would be to extend this work for device-level identification, which presents further challenges than those of the model-level identification at hand. Such an approach would be particularly useful when LEAs need to discriminate between two devices of the same camera model.5.1. Number of patches
5.2. Hierarchical vs flat approach
Classification fold 1 fold 2 fold 3 fold 4 fold 5 Average Flat 0.986 0.987 0.988 0.979 0.986 Hierarchical 0.990 0.991 0.989 0.990 0.989 5.3. Future work
6. Conclusion
We propose a new approach for camera model identification that leverages the homogeneous regions in given images, which are less distorted by the scene content, for reliable extraction of forensic traces. We showed that when such input data is trained in a hierarchical fashion, it results in a classifier that is computationally efficient, modular and more effective than a flat (single classifier) approach. Modular design allows the addition of other camera brands without having to retrain the model classifiers of the already known brands. The accuracy rate of 99.01% that we achieve is the best ever reported for the ‘natural’ subset of the benchmark Dresden data set of 18 camera models.
CRediT authorship contribution statement
Guru Swaroop Bennabhaktula: Methodology, Validation, Software, Investigation, Resources, Writing – original draft, Writing – review & editing. Enrique Alegre: Conceptualization, Methodology, Writing – review & editing, Validation, Investigation, Supervision, Project administration. Dimka Karastoyanova: Writing – review & editing. George Azzopardi: Conceptualization, Methodology, Writing – original draft, Validation, Investigation, Writing – review & editing, Supervision, Project administration.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This work was supported by the framework agreement between the University of León and INCIBE (Spanish National Cybersecurity Institute), Spain under Addendum 01. This research has been partly funded with support from the European Commission under the 4NSEEK project with Grant Agreement 821966. This publication reflects the views only of the authors, and the European Commission cannot be held responsible for any use which may be made of the information contained therein. We thank the Center for Information Technology of the University of Groningen for their support and for providing access to the Peregrine high-performance computing cluster.
References
- Barni et al., 2017
- Barni M., Bondi L., Bonettini N., Bestagini P., Costanzo A., Maggini M., Tondi B., Tubaro S.Aligned and non-aligned double JPEG detection using convolutional neural networksJournal of Visual Communication and Image Representation, 49 (2017), pp. 153-163
- Barni et al., 2016
- Barni M., Chen Z., Tondi B.Adversary-aware, data-driven detection of double JPEG compression: How to make counter-forensics harder2016 IEEE international workshop on information forensics and security, IEEE (2016), pp. 1-6
- Bayar and Stamm, 2016
- Bayar, B., & Stamm, M. C. (2016). A deep learning approach to universal image manipulation detection using a new convolutional layer. In Proceedings of the 4th ACM workshop on information hiding and multimedia security (pp. 5–10).
- Bayar and Stamm, 2018a
- Bayar B., Stamm M.Constrained convolutional neural networks: A new approach towards general purpose image manipulation detectionIEEE Transactions on Information Forensics and Security, 13 (11) (2018), pp. 2691-2706
- Bayar and Stamm, 2018b
- Bayar B., Stamm M.C.Towards open set camera model identification using a deep learning framework2018 IEEE international conference on acoustics, speech and signal processing, IEEE (2018), pp. 2007-2011
- Bayram et al., 2005
- Bayram S., Sencar H., Memon N., Avcibas I.Source camera identification based on CFA interpolationIEEE international conference on image processing 2005 (vol. 3), IEEE (2005), pp. III-69
- Bennabhaktula et al., 2020
- Bennabhaktula G.S., Alegre E., Karastoyanova D., Azzopardi G.Device-based image matching with similarity learning by convolutional neural networks that exploit the underlying camera sensor pattern noiseProceedings of the 9th international conference on pattern recognition applications and methods (vol. 1) (2020), pp. 578-584, 10.5220/0009155505780584
- Bondi et al., 2016
- Bondi L., Baroffio L., Güera D., Bestagini P., Delp E.J., Tubaro S.First steps toward camera model identification with convolutional neural networksIEEE Signal Processing Letters, 24 (3) (2016), pp. 259-263
- Bondi et al., 2017
- Bondi L., Lameri S., Güera D., Bestagini P., Delp E.J., Tubaro S.Tampering detection and localization through clustering of camera-based CNN features2017 IEEE conference on computer vision and pattern recognition workshops, IEEE (2017), pp. 1855-1864
- Bunk et al., 2017
- Bunk J., Bappy J.H., Mohammed T.M., Nataraj L., Flenner A., Manjunath B., Chandrasekaran S., Roy-Chowdhury A.K., Peterson L.Detection and localization of image forgeries using resampling features and deep learning2017 IEEE conference on computer vision and pattern recognition workshops, IEEE (2017), pp. 1881-1889
- Cao and Kot, 2009
- Cao H., Kot A.C.Accurate detection of demosaicing regularity for digital image forensicsIEEE Transactions on Information Forensics and Security, 4 (4) (2009), pp. 899-910
- Chen et al., 2008
- Chen M., Fridrich J., Goljan M., Lukás J.Determining image origin and integrity using sensor noiseIEEE Transactions on Information Forensics and Security, 3 (1) (2008), pp. 74-90
- Chen et al., 2015
- Chen J., Kang X., Liu Y., Wang Z.J.Median filtering forensics based on convolutional neural networksIEEE Signal Processing Letters, 22 (11) (2015), pp. 1849-1853
- Chen and Stamm, 2015
- Chen C., Stamm M.C.Camera model identification framework using an ensemble of demosaicing features2015 IEEE international workshop on information forensics and security, IEEE (2015), pp. 1-6
- Cozzolino and Verdoliva, 2019
- Cozzolino D., Verdoliva L.Noiseprint: a CNN-based camera model fingerprintIEEE Transactions on Information Forensics and Security, 15 (2019), pp. 144-159
- Fan et al., 2015
- Fan W., Wang K., Cayre F.General-purpose image forensics using patch likelihood under image statistical models2015 IEEE international workshop on information forensics and security, IEEE (2015), pp. 1-6
- Gloe and Böhme, 2010
- Gloe, T., & Böhme, R. (2010). The ‘Dresden Image Database’ for benchmarking digital image forensics. In Proceedings of the 2010 ACM symposium on applied computing (pp. 1584–1590).
- Gu et al., 2017
- Gu K., Jakhetiya V., Qiao J.-F., Li X., Lin W., Thalmann D.Model-based referenceless quality metric of 3D synthesized images using local image descriptionIEEE Transactions on Image Processing, 27 (1) (2017), pp. 394-405
- Gu, Li, et al., 2017
- Gu K., Li L., Lu H., Min X., Lin W.A fast reliable image quality predictor by fusing micro-and macro-structuresIEEE Transactions on Industrial Electronics, 64 (5) (2017), pp. 3903-3912
- Gu et al., 2016
- Gu K., Lin W., Zhai G., Yang X., Zhang W., Chen C.W.No-reference quality metric of contrast-distorted images based on information maximizationIEEE Transactions on Cybernetics, 47 (12) (2016), pp. 4559-4565
- Gu, Zhou, et al., 2017
- Gu K., Zhou J., Qiao J.-F., Zhai G., Lin W., Bovik A.C.No-reference quality assessment of screen content picturesIEEE Transactions on Image Processing, 26 (8) (2017), pp. 4005-4018
- Ioffe and Szegedy, 2015
- Ioffe S., Szegedy C.Batch normalization: Accelerating deep network training by reducing internal covariate shiftInternational conference on machine learning, PMLR (2015), pp. 448-456
- Kang et al., 2013
- Kang X., Stamm M.C., Peng A., Liu K.R.Robust median filtering forensics using an autoregressive modelIEEE Transactions on Information Forensics and Security, 8 (9) (2013), pp. 1456-1468
- Kang and Wei, 2008
- Kang X., Wei S.Identifying tampered regions using singular value decomposition in digital image forensics2008 International conference on computer science and software engineering, 3, IEEE (2008), pp. 926-930
- Kharrazi et al., 2004
- Kharrazi M., Sencar H.T., Memon N.Blind source camera identification2004 International conference on image processing, 2004 (vol. 1), IEEE (2004), pp. 709-712
- Kirchner and Fridrich, 2010
- Kirchner M., Fridrich J.On detection of median filtering in digital imagesMedia forensics and security II (vol. 7541), International Society for Optics and Photonics (2010), Article 754110
- Kirchner and Gloe, 2015
- Kirchner M., Gloe T.Forensic camera model identificationHandbook of digital forensics of multimedia data and devices, Wiley Online Library (2015), pp. 329-374
- Kurosawa et al., 1999
- Kurosawa K., Kuroki K., Saitoh N.CCD fingerprint method-identification of a video camera from videotaped imagesProceedings 1999 international conference on image processing (cat. 99CH36348) (vol. 3), IEEE (1999), pp. 537-540
- Li, 2010
- Li C.-T.Source camera identification using enhanced sensor pattern noiseIEEE Transactions on Information Forensics and Security, 5 (2) (2010), pp. 280-287
- Li et al., 2009
- Li C.-T., Chang C.-Y., Li Y.On the repudiability of device identification and image integrity verification using sensor pattern noiseInternational conference on information security and digital forensics, Springer (2009), pp. 19-25
- Li et al., 2014
- Li J., Li X., Yang B., Sun X.Segmentation-based image copy-move forgery detection schemeIEEE Transactions on Information Forensics and Security, 10 (3) (2014), pp. 507-518
- Lin and Li, 2015
- Lin X., Li C.-T.Preprocessing reference sensor pattern noise via spectrum equalizationIEEE Transactions on Information Forensics and Security, 11 (1) (2015), pp. 126-140
- Lukas et al., 2006
- Lukas J., Fridrich J., Goljan M.Digital camera identification from sensor pattern noiseIEEE Transactions on Information Forensics and Security, 1 (2) (2006), pp. 205-214
- Luo et al., 2010
- Luo W., Huang J., Qiu G.JPEG error analysis and its applications to digital image forensicsIEEE Transactions on Information Forensics and Security, 5 (3) (2010), pp. 480-491
- Marra et al., 2017
- Marra F., Poggi G., Sansone C., Verdoliva L.A study of co-occurrence based local features for camera model identificationMultimedia Tools and Applications, 76 (4) (2017), pp. 4765-4781
- Mayer et al., 2018
- Mayer, O., Bayar, B., & Stamm, M. C. (2018). Learning unified deep-features for multiple forensic tasks. In Proceedings of the 6th ACM workshop on information hiding and multimedia security (pp. 79–84).
- Mittal et al., 2012
- Mittal A., Moorthy A.K., Bovik A.C.No-reference image quality assessment in the spatial domainIEEE Transactions on Image Processing, 21 (12) (2012), pp. 4695-4708
- Rafi, Tonmoy, et al., 2020
- Rafi A.M., Tonmoy T.I., Kamal U., Wu Q.J., Hasan M.K.RemNet: remnant convolutional neural network for camera model identificationNeural Computing and Applications (2020), pp. 1-16
- Rafi, Wu, and Hasan, 2020
- Rafi A.M., Wu J., Hasan M.K.L2-constrained RemNet for camera model identification and image manipulation detectionEuropean conference on computer vision, Springer (2020), pp. 267-282
- Rosenfeld and Sencar, 2009
- Rosenfeld K., Sencar H.T.A study of the robustness of PRNU-based camera identificationMedia forensics and security (vol. 7254), International Society for Optics and Photonics (2009)72540M
- Swaminathan et al., 2007
- Swaminathan A., Wu M., Liu K.R.Nonintrusive component forensics of visual sensors using output imagesIEEE Transactions on Information Forensics and Security, 2 (1) (2007), pp. 91-106
- Tang et al., 2018
- Tang H., Ni R., Zhao Y., Li X.Median filtering detection of small-size image based on CNNJournal of Visual Communication and Image Representation, 51 (2018), pp. 162-168
- Thai et al., 2013
- Thai T.H., Cogranne R., Retraint F.Camera model identification based on the heteroscedastic noise modelIEEE Transactions on Image Processing, 23 (1) (2013), pp. 250-263
- Timmerman et al., 2020
- Timmerman D., Bennabhaktula G.S., Alegre E., Azzopardi G.Video camera identification from sensor pattern noise with a constrained ConvNetICPRAM 2021 proceedings, SciTePress (2020)10th International Conference on Pattern Recognition Applications and Methods ICPRAM 2021
- Tuama et al., 2016
- Tuama A., Comby F., Chaumont M.Camera model identification with the use of deep convolutional neural networks2016 IEEE international workshop on information forensics and security, IEEE (2016), pp. 1-6
- Wang et al., 2019
- Wang X., Wang H., Niu S.An image forensic method for AI inpainting using faster R-CNNInternational conference on artificial intelligence and security, Springer (2019), pp. 476-487
- Wang and Zhang, 2016
- Wang Q., Zhang R.Double JPEG compression forensics based on a convolutional neural networkEURASIP Journal on Information Security, 2016 (1) (2016), pp. 1-12
- Xu and Shi, 2012
- Xu G., Shi Y.Q.Camera model identification using local binary patterns2012 IEEE international conference on multimedia and expo, IEEE (2012), pp. 392-397
- Zhu et al., 2018
- Zhu X., Qian Y., Zhao X., Sun B., Sun Y.A deep learning approach to patch-based image inpainting forensicsSignal Processing: Image Communication, 67 (2018), pp. 90-99
Cited by (0)
- 1
https://github.com/bgswaroop/scd-images.
- 2
The in-built Python function rounds to the nearest even number for half-integer values.