Pneumonia Detection Using Chest X-Rays and Convolutional Neural Network (A Step-by-Step Guide) Part 1

10 min readSep 7, 2020

I. Introduction

Given the current pandemic, it would be interesting to explore the ways in which data science could help fight the coronavirus. In this two-part post, we will help you become familiar with the general workflow of an image processing project by providing a step-by-step guide for designing and building a Convolutional Neural Network (CNN) for pneumonia detection in Pytorch. In order to get the most out of this post, the readers are recommended to first familiarise themselves with the basic concept of neural networks and convolutional filters.

Before jumping into the main section, let’s first get a better idea of what is pneumonia and what are the distinguishing features it might have in an X-Ray image.

Pneumonia is defined as a lung inflammation disease that can be caused by bacteria, viruses or fungi [1]. Each type may have a different manifestation in X-ray images which makes the classification task more difficult. An example of the comparison between PNEUMONIA and NORMAL cases is presented below. As shown in the picture, due to liquid filling the lungs during the infection, there are usually hazy patches in the chest X-ray for PNEUMONIA cases. However, it is worth noting these conclusions are drawn solely from visual inspection of a subset of the images. It is likely that there are other distinguishing features that require much more domain knowledge to identify.

Note that the code in this post makes use of a trained model and data that can be downloaded here.

II. Outline

To help you get a better understanding of how the workflow of this project is structured, let’s first go through the outline of the remaining post.

In part one of the post, we will first gain a better understanding of the currrent dataset and identity the data augmentation techniques that are required for more robust predictions. Then we will move on to build the CNN architecture. The rationale for the associated hyperparameter choices will be explained along the way.

In the second part of the post, we will focus on the optimisation, where the discussion will primarily revolve around the algorithm, learning rate and stopping criteria choice. Finally, we present the results obtained from our current model and discuss its potential implications.

III. Dataset Description and Pre-processing

First thing first: let’s import the packages and set up the cuda GPU and file path to our train, test and validation dataset. The random seed is set at the outset to ensure reproducibility.

import torch
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import numpy.linalg as lg
import torch.nn as nn#If a CUDA enabled GPU is available, use it for the training
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")#File paths to the train, test and validation sets 
training_set_path = os.path.join(os.getcwd(), "data_224_224_big_val_flip", "train")
test_set_path = os.path.join(os.getcwd(), "data_224_224_big_val_flip", "test")
validation_set_path = os.path.join(os.getcwd(), "data_224_224_big_val_flip", "val")#Setting the seed for random number generation
np.random.seed(0)
torch.manual_seed(0)
torch.cuda.manual_seed(0)
torch.backends.cudnn.deterministic = True
random.seed(0)

Now let’s have a closer look at our dataset. The entire dataset consists of a total of 5830 images, which are pre-sorted into training, validation, and testing sets with sizes of 5190, 16, and 624 respectively [2]. Since a validation set of size 16 may be too small to produce a generalisable result, we expanded its size to 100 by moving images from the training set. The input data are torch tensor representations of each image with normalised pixel intensities (ranging from 0 to 1) as features. Since the feature dimensions in the original dataset vary across images (e.g. 3x1469x993, 3x932x588), the images were resized to 3x224x224. All images are labelled as either NORMAL or PNEUMONIA. The proportions of the two classes are, however, imbalanced in all three sets with NORMAL to PNEUMONIA ratio shown in the table below.

Table 1 Class Ratio Prior to Data Augmentation

To combat such class imbalance issue, we applied horizontal flipping to the normal images. However, this was only applied to the training set as in practice we are likely to encounter class imbalances in test set. Therefore, the validation and test set are kept unchanged for better generalisability.

Table 1 Class Ratio After Horizontal Flipping

As shown in table 2, we have doubled the number of normal images and the class ratio is a lot closer to 50:50. Note that we didn’t completely remove the imbalance as in medical diagnosis, false negative is considered more significantly more costly than false positive. Therefore, a slight imbalance might help to improve our recall.

Although we have now addressed the class imbalance issue, the total size of our training set is still quite limited. This issue is especially important in the context of computer vision, as the field is notorisly known for being data-hungry, and the model performance is often correlated with the size of the training data. Therefore, we apply two additional data augmentation strategies : 20 degree rotations and 120% rescaling. A 120% rescale was added on top of the rotation, as based on some experiments, this yieled better results than applying the two augmentation strategies separately. Different to the horizontal flipping, these two strategies are applied with a 50% probability on the fly as we load our training data. This means we did not increase the actual size of the training set, but instead, we are now sampling images from a much larger sample space.

To achieve this, we will first define a set of transformation routines using torchvision.transforms. Then we will let pytorch know where to get our dataset and what transformations are required through passing the file path and pre-defined transformation to the ImageFolder from torchvision.datasets. Finally, our data loader is defined to generate images for a given batch size.

"""
    Function that returns a data loader given a data path with some included parameters for shuffling and augmentation.
    
    @param
    path: path to the data set to be loaded
    batch_size: size of each batch for the data loader
    shuffle: whether or not to shuffle data during iteration
    to_augment: whether or not to apply random transformations (rotation by 20 degrees / scaling by 1.2)
    
    @return
    torch DataLoader object that loads the data in the given path 
    """def get_loader(path, batch_size=16, shuffle=True, to_augment=True):    if to_augment:
        affine1=transforms.RandomAffine(degrees=20, translate=None, scale=None, shear=None, resample=False, fillcolor=0)
        affine2=transforms.RandomAffine(degrees=20, translate=None, scale=(1.2,1), shear=None, resample=False, fillcolor=0)
        random_affine=transforms.RandomApply([affine1,affine2],p=0.5)
        transform=transforms.Compose([
            random_affine,
            transforms.ToTensor()])
    else:
        transform = transforms.Compose([transforms.ToTensor()])
    data_folder = torchvision.datasets.ImageFolder(root=path,transform=transform)
    data_loader = torch.utils.data.DataLoader(data_folder,batch_size=batch_size, shuffle=shuffle)
    return data_loadertrain_loader = get_loader(training_set_path)
test_loader = get_loader(test_set_path,batch_size=1,to_augment=False)
val_loader = get_loader(validation_set_path,to_augment=False)

IV. Network Architecture

Having get the data-preprocessing steps out of our way, we can finally now build our neural network. The network presented here is a simplified version of the well-known VGG-16 [3]. Although the original VGG-16 network contains five convolutional layers, the current model only includes four, as based on experiments, additional convolutional layers did not improve model performance. The overview of our model architecture is illustrated in figure 3. Let’s now go through our design process and what does each green block actually mean.

Convolutional Layer

For each convolutional layer, there are four hyper-parameter choices: number of filters, filter size, stride, and padding size. For the first hyper-parameter, we follow the same pattern as VGG-16, where the model starts with 64 filters and this number gets doubled with each additional convolutional layer. To decide on our filter size, let’s consider the main feature that distinguishes NORMAL and PNEUMONIA, i.e. the hazy patches due to liquid in the lungs. Since these patches are relatively large in size, we set our filter size to be 5x5 to capture these high level representations. Regarding the third parameter, the stride is, by convention, set to be one for all convolutional layers.

In addition to the three parameter choices, we could also pad the input images into the convolutional layers with zeros to prevent undersampling on the edges [3]. The size of padding is chosen such that the spatial dimensions of each image remain the same as before the application of convolutions. The exact value is computed using the formula below, where 𝑁, 𝑃, 𝐹, and 𝑆 represent the spatial dimension of the image, padding size, filter size and stride respectively.

Here is an example of zero-padding with size 2.

Below is the code snippet for the pytorch implementations of the aforementioned convolutional layers.

class Net(nn.Module):
  def __init__(self):
     super(Net, self).__init__()
     self.conv1_1 = nn.Conv2d(3,64,5,stride=1,padding=2)
     self.conv2_1 = nn.Conv2d(64,128,5,stride=1,padding=2)
     self.conv3_1 = nn.Conv2d(128,256,5,stride=1,padding=2)
     self.conv4_1 = nn.Conv2d(256,512,5,stride=1,padding=2)

Pooling Layer

Having defined our convolutional layers, we now consider another type of layer: max-pooling. It is applied to the output of the Relu activation function to shrink the height and width of the image by retaining only the maximum value of a given neighbourhood. This operation not only reduces the computational cost, but also result in a greater spatial invariance, as the output is less affected by small variations to the small valued pixels. We set the kernel (neighbourhood) size of the first three pooling layers to be 2x2 with stride of two by convention. Since we are working with a limited vRAM, a larger shrinkage is applied in the last pooling layer using a bigger kernel of size 4x4,with stride of 4.

class Net(nn.Module):
  def __init__(self):
     super(Net, self).__init__()
     self.conv1_1 = nn.Conv2d(3,64,5,stride=1,padding=2)
     self.conv2_1 = nn.Conv2d(64,128,5,stride=1,padding=2)
     self.conv3_1 = nn.Conv2d(128,256,5,stride=1,padding=2)
     self.conv4_1 = nn.Conv2d(256,512,5,stride=1,padding=2)
     self.pool = nn.MaxPool2d(2,2)
     self.big_pool = nn.MaxPool2d(4,4)

Fully Connected Layer

Similar to VGG-16, after the convolutional layers, the flattened image matrix is passed through three fully connected layers with Relu activations. Considering the limited vRAM capacity, the current model used fully-connected layers that are half the size of those in the original VGG-16 (2048 vs. 4096).

class Net(nn.Module):
  def __init__(self):
     super(Net, self).__init__()
     self.conv1_1 = nn.Conv2d(3,64,5,stride=1,padding=2)
     self.conv2_1 = nn.Conv2d(64,128,5,stride=1,padding=2)
     self.conv3_1 = nn.Conv2d(128,256,5,stride=1,padding=2)
     self.conv4_1 = nn.Conv2d(256,512,5,stride=1,padding=2)
     self.pool = nn.MaxPool2d(2,2)
     self.big_pool = nn.MaxPool2d(4,4)
     self.fc1 = nn.Linear(7*7*512, 2048)
     self.fc2 = nn.Linear(2048,2048)
     self.fc3 = nn.Linear(2048,2)

Dropout Layer

Furthermore, since the fully connected layers are especially prone to overfitting, a dropout layer was added in front of every linear projection. The basic idea of dropout is visualised in the figure 5. First, it zeroes out input neurons with a certain probability (p), which is set to be 0.1 in our case. The resulting values are then up-scaled by a factor of 1/(1-p) to keep the expectation approximately the same.

class Net(nn.Module):
  def __init__(self):
     super(Net, self).__init__()
     self.conv1_1 = nn.Conv2d(3,64,5,stride=1,padding=2)
     self.conv2_1 = nn.Conv2d(64,128,5,stride=1,padding=2)
     self.conv3_1 = nn.Conv2d(128,256,5,stride=1,padding=2)
     self.conv4_1 = nn.Conv2d(256,512,5,stride=1,padding=2)
     self.pool = nn.MaxPool2d(2,2)
     self.big_pool = nn.MaxPool2d(4,4)
     self.fc1 = nn.Linear(7*7*512, 2048)
     self.fc2 = nn.Linear(2048,2048)
     self.fc3 = nn.Linear(2048,2)
     self.dropout = nn.Dropout(0.1)

Now we have incrementally intialised all the layers we will need to build out final neural network, let’s arrange these layers in the order outlined in figure 3. This is achieved by defining the forward function as shown below.

class Net(nn.Module):
    """
    Neural network class that the model is based on (see description of architecture above). 
    """
    def __init__(self):
        super(Net, self).__init__()        
        self.conv1_1 = nn.Conv2d(3,64,5,stride=1,padding=2)
        self.conv2_1 = nn.Conv2d(64,128,5,stride=1,padding=2)
        self.conv3_1 = nn.Conv2d(128,256,5,stride=1,padding=2)
        self.conv4_1 = nn.Conv2d(256,512,5,stride=1,padding=2)
        self.pool = nn.MaxPool2d(2,2)
        self.big_pool = nn.MaxPool2d(4,4)
        self.fc1 = nn.Linear(7*7*512, 2048)
        self.fc2 = nn.Linear(2048,2048)
        self.fc3 = nn.Linear(2048,2)
        self.dropout = nn.Dropout(0.1)
    
    def forward(self, x):
        """
        Forward pass function for the neural network 
        """
        #1st Convolutional Layer
        #Input dimension: 3 x 224 x 224
        #Output dimension: 64 x 112 x 112 
        x = self.conv1_1(x)
        x = self.pool(F.relu(x))
        #2nd Convolutional Layer
        #Input dimension: 64 x 112 x 112 
        #Output dimension: 128 x 64 x 64
        x = self.conv2_1(x)
        x = self.pool(F.relu(x))
        #3rd Convolutional Layer
        #Input dimension: 128 x 56 x 56
        #Output dimension: 256 x 28 x 28 
        x = self.conv3_1(x)
        x = self.pool(F.relu(x))
        #4th Convolutional Layer
        #Input dimension: 256 x 28 x 28 
        #Output dimension: 512 x 7 x 7
        x = self.conv4_1(x)
        x = self.big_pool(F.relu(x))
        #Stretching out the image matrix into a vector 
        x = x.view(-1, 7*7*512)
        #Adding 10% dropout 
        x = self.dropout(x)
        #1st fully connected layer with Relu Activation
        #Input dimension: 25088
        #Output dimension: 2048
        x = F.relu(self.fc1(x))
        #Adding 10% dropout 
        x = self.dropout(x)
        #2nd fully connected layer with Relu Activation
        #Input dimension: 2048
        #Output dimension: 2048
        x = F.relu(self.fc2(x))
        #Adding 10% dropout 
        x = self.dropout(x)
        #3rd fully connected layer outputing classification outcomes
        #Input dimension: 2048
        #Output dimension: 2
        x = self.fc3(x)
        return x

Finally we have fully defined our network, but how will the network be trained and optimised? We will go through this in more details in part 2 of the post. See you there ;)

References

[1] NHS. 2019. Pneumonia. https://www.nhs.uk/conditions/pneumonia/.
[2] Paul Mooney. 2018. Chest X-Ray Images (Pneumonia). https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia.
[3] Andrew Ng. 2017. C4W2L02 Classic Network. https://www.youtube.com/watch?v=dZVkygnKh1M.
[4] Weidong Xu, Zeyu Zhao, and Tianning Zhao.Tutorial: Dropout as Regularization and Bayesian Approximation. https://xuwd11.github.io/Dropout_Tutorial_in_PyTorch/.