Classifying Animals in The Wild

By William and Mait


We are working on a project to classify different animals from I-wild cam images, as part of a Kaggle competition. Camera Traps (or Wild Cams) enable the automatic collection of large quantities of image data. Biologists all over the world use camera traps to monitor biodiversity and population density of animal species. The purpose of our project is to automatically determine the species of animals present in each picture. The Kaggle link to the project is here [4]


The WCS training set contains 217,959 images from 441 locations, and the WCS test set contains 62,894 images from 111 locations. The data is highly imbalanced: 34% of train images are empty. Additional challenges are the large number of categories, that is, animal species present (216), and the fact that some pictures have multiple animals (for example different species at a watering hole), but only one label. Train and test locations are different, so the network cannot take a shortcut by learning locations (e.g. a certain tree is popular with squirrels), and must learn the animal shapes themselves. In addition, images are taken in different lighting conditions. A sample of training images is shown below:

Among the challenges associated with the images are poor lighting (as seen in some images above), motion blur due to fast moving animals, occlusion by vegetation and occasional camera malfunctions. The neural network shall need to work in spite of all these additional factors.


We trained neural network models using 2 architectures, Efficient B0 and Resnet-50. Implementation of those networks on iWildCam dateset are described below. Those two convolutional neural network architectures were chosen based on their prevalence in relevant literature [5, 6]. As for 2021 Resnet architecture is the most used architecture for image classification in relevant empirical papers, followed by the family of EffientNet based architectures (B0-B10).


Our Resnet-50 model was trained on only 15,000 images, to make the training process faster and enable us to try different methodologies, as well as staying within the limits of Google Colab/ Google Drive capacity. Inevitably, using fewer images results in poorer classification accuracy than could have been obtained by using the whole dataset. We applied transformations to the train images, to make the model more robust and less prone to overfitting. An example of same image transformations is shown below:

The model was initialized using the cnn_learner() method and images were converted into ImageDataBunch lists, which the format that the model accepts. We used the lr_find() method to find the optimal learning rate, where the gradient of the loss function shows the quickest drop; our optimal learning rate was 0.03. A total of 10 epochs were used in training the network: 5 for fine tuning weights of the head, and 5 for the rest of the body. In hindsight, using more epochs could have produced a better result by a few percentage points, and should certainly be done if more time is available.

Training and validation loss curves for ResNet-50

We used the default loss function with Resnet-50, cross-entropy loss. Notably, train loss is consistently higher than validation loss, which is unexpected. We believe that it’s due to the transformations applied to train images (rotations, reflections, etc). If that’s the case, then it’s totally understandable why train loss would be higher than that of untransformed validation images.

Predictions were made using Test Time Augmentation, that is, each testing image was transformed 3 times, and the final prediction was a majority vote across all 4 predictions (the original image and the 3 transformations). We found that this gave more accurate results than simple predictions. Somewhat surprisingly, making predictions was the most time consuming task; total training time for the network was approximately 3 hours, while vanilla predictions took 7 hours, and TTA took 13 hours to run.

Trial with combination of custom CDNN and EfficentNet B0 — sadly unsuccessful

From the initial implementations of Resnet-50 and EfficentNet B0 — B7, the imbalance of categories in the training data emerged as a general challenge. Around 34% of training images out of 217 959 training samples are empty. Further, only 33 classes out of 216 have over 1000 samples and over half of the categories having less than 50 samples. As an additional problem, the Kaggle test set for the competition has 267 classes, which means that the test set has 50 classes that are not present in the training data.

To mitigate the class imbalance problem, we decided to try out a two-stage image classification pipeline.

  1. Custom CDNN to categorize between empty and with animal images

We implemented a custom CDNN network with 2 convolution layers and two dense layers implemented at the top of CN layers to categorize between images where animal is present vs where animal is not present. The general architecture of the network with final used hyperparameters can be seen in the following Jupyter Notebook [1].

This network was trained on 90 0000 images downsized to 56 x 56 pixels. Train and test data were normalized and train data was augmented by using TensorFlow’s ImageDataGenerator class. Train and test data were generated at 70%/30% split.

Hyperparameters (layer count, dropout rates, learning rate) were tuned on a smaller n = 10 000 subsample of data. Of note, heterogeneity of image meant that in general the set of hyperparameters found on subsample of data did not generalize well to the training run with full dataset, a problem which was especially pronounced in the implementation of following EfficentNet B0 model. Therefore, following regularization techniques were implemented: multiple batch normalization layers, multiple dropout layers. Relatedly, we used Keras callbacks to reduce the learning rate if validation loss has not decreased for 2 batches by a factor of 0.2 up to a level of 1^-6. In the case of the plateau, where point was reached when between epochs training accuracy kept on increasing but validation accuracy started to decrease, the model with best validation accuracy was saved. Early stopping strategy through relevant callback was implemented to terminate the training early, if the validation loss stopped decreasing over four successive batches.

Initial hyperparameter tuning suggested that larger batch sizes are better performing, therefore batch size was adjusted to the maximum level (128 for custom CDNN; 32 for EfficentNet B0 implementations) of hardware available (trial and error until the CUDA out of memory error was encountered).

Adam optimizer was used in conjunction with binary cross entropy loss function. Training reached early stopping on epoch 9 with best validation score being seen from epoch no 8, where similarly test and train accuracy was at 91%. Training run is presented in the following figure 1.0. Notebook [1] also includes tensor board callback, which presents test data interactively and in more detail.

Figure 1. Animal present or not classification with custom DCNN.

2. Fully trainable EfficentNet B0 to categorize animals in pictures where animal is present

Using all the images from the training set, where animals are present (n = 143 742), we created a DCNN model with EfficentNet B0 architecture. EfficentNet B0 top layers were rebuilt adding global average pooling layer, flattening, 512 neuron sized dense layer (with L1L2 regularization and ReLU activation) and finally regularizing dropout layer and softmax output layer. The initial weights were loaded from ImageNet and all the layers from the network were specified as being trainable, resulting in 4 773 715 trainable parameters. Train/test split of 70%/30%, Adam optimizer and categorical cross entropy was used.

As described previously for custom CDNN, similar data augmentation techniques were used in preprocessing, except images were resized to 224 by 224 required by EfficentNet B0 input requirement. In optimizing hyperparameters similar hurdles emerged as previously described in the context of custom CDNN. Therefore a set of similar decisions were taken here as well: mainly, using early stopping and adaptive learning rate callbacks and maximizing batch size to available GPU memory budget.

Implementation of this network is presented in the Jupyter Notebook with reference [2]. Figure 2 shows the training results, which indicate a high level of overfitting. Training was terminated by relevant callback due to plateauing in validation loss.

Figure 2. Trainable EfficentNet B0 model initialized with ImageNet weights and custom top layers.

We also implemented a second EfficentNet where we fixed the weights loaded ImageNet, while allowing training only for top 10 layers. This network included global average pooling; batch normalization; flatten and 2 hidden dense layers (n =512; ReLU; L1L2 regularization) and softmax output layer. This network is described in reference [3]. By and large, we tried to implement transfer learning by which we use previously trained model in new tasks.

Figure 3 shows the training results, which indicate a very low performance (around 11.6% both in validation and test set).

Figure 3. EfficentNet B0 model initialized with ImageNet weights and custom top layers using fixed weights except for top 10 layers.


An obvious way to improve accuracy would be to use more data in training models, starting with the full Kaggle dataset, and possibly previous year datasets, which are also suggested as potential training data. The main challenge would be slow training time/ large memory requirements.


[1] — Animal present or not classification with custom CDNN:

[2] — Fully trainable EfficentNet B0 model, with custom top layers:

[3] — EfficentNet B0 model with restricted training schedule (top 10 layers trainable) and custom top layers:

[4] — Link to dataset:

[5] — EfficentNet architecture usage and general overview:

[6] —Resnet architecture usage and general overview: