Lung Cancer Detection
There are about 225,000 new cases of lung cancer every year in the United States, with a cost of $12 Billion. Although doctors try to do everything they can to accurately diagnose patients, even they are prone to mistakes and human error. By combining the dataset provided by Kaggle and machine learning algorithms, we will try to accurately predict whether a person will develop lung cancer within the next year based solely on their CT Scans.
Problem Statement
The current Lung Cancer detection accuracy among doctors is roughly 74% (source: www.ncbi.nlm.nih.gov). The graph below displays the confusion matrix of doctors’ predictions. Although they are able to catch the majority of patients that develop cancer, there is a high number of False Positives. This is problematic because of high medical costs, high anxiety of chemotherapy and cancer drugs, and the deadly risks that come with surgery.
Goal
Our goal is to use data science to create a better model that will increase overall accuracy. Ideally, we would like to decrease False Positives without increasing False negatives due to the risks of not detecting cancer in the early stages.
Methods
We will pre-process the CT scans into 3d arrays of the lungs, then feed the arrays into a Convolutional Neural Network (CNN). CNNs are great for finding patterns in spatial composition. They work well with 2D images, 3D images, sentence structure, etc. The Kaggle competition will be measuring in log loss. This means that we will find the probability that a patient will develop cancer. Log loss works so that an algorithm is severely penalized if the probability is way off.
Risks and Assumptions
One of the main risks in dealing with this data is that we are assuming that there is no information on the patients themselves. The dataset does not give us gender, age, or how frequently they smoke. These factors contribute heavily towards whether or not the person will develop lung cancer. The competition has left these variable out on purpose in order to ensure that the participants use machine vision to approach this problem. Also, the dataset only includes 1400 patients. This is very small since we are going to have roughly 2 million pixels/inputs.
Procedure
Its a well known fact that cancer usually starts at the lymph nodes. We have to first create a mask of the lymph nodes and isolate them for each scan. Then we will stack them on top of each other to get a 3D array of a single person’s lymph nodes. Below is a diagram of what I did.
Segmenting the lymph nodes took quite awhile for 1400 patients (~16 hours). I scaled each patients’ array down to 128x128x128 image due to the constraints of my local computer and time.
The Convolutional Neural Network
After obtaining the 3d arrays of each patient, I put the entire dataset into a convolutional neural network (CNN). Here’s a good example of what the CNN looks like:
I used sklearn’s classifier neural network called sknn.mlp. I used 2 convolutional layers, a rectifier layer, and softmax layer. I used a very small learning rate of 0.00002. This allow the computer to look at the image with more detail, at the risk of overfitting.
The neural network trained for 7 epochs and reached a minimum training error of 0.1. Using the trained network, I tested it on roughly 300 patients that the network had never seen before. From there, I was able to obtain the results of the test set.
Results
The neural network had an accuracy of 69%. The following show the confusion matrix and a histogram of the predicted probabilities.
These results show that we were able to correctly predict that most people did not have cancer. It managed to catch 17 people that had cancer. We were able to decrease the false positive rate of doctors from 0.94 down to 0.66. However, the false negative rate increased from 0.74 up to 0.91.
This histogram shows that the neural network predicted that most people did not have cancer. This makes sense in the general population. If we look at the patients that the network predicted had the highest probability to have cancer, it got it right about half the time. Typically we’d want to see better odds than 50/50.
Conclusion
Although the convolutional neural network had a decent 69% accuracy, it did not beat the baseline accuracy of 74%. Also, there are some ethical issues to take into account in predicting cancer. Rather than simply trying to increase accuracy, it is more important to catch people that have cancer rather then overdiagnosing patients. This means that we would rather have a low false negative rate than a low false positive rate since the value of human life is so great. Therefore, in a practical sense, we cannot use this algorithm to help us predict the onset of lung cancer. As developments in neural network and computers arise, I am confident that we can beat the baseline accuracy and eventually predict lung cancer with greater than 95% accuracy.
Please check out the original code in my jupyter notebook here