In the previous article in this Machine Learning series, we looked at k-NN Classification with Apache® Ignite™. We’ll now look at another Machine Learning algorithm and conclude our series. In this article, we’ll look at K-Means Clustering using the Titanic dataset. Very conveniently, Kaggle provides the dataset in a CSV form. For our analysis, we are interested in two clusters: whether passengers survived or did not survive.
Some cleanup and formatting is required to get the data into a suitable format for Apache Ignite. The CSV data contains a number of columns, as follows:
- Passenger id
- Survived (0 = no, 1 = yes)
- Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- Passenger name
- Gender
- Age in years
- Number of siblings / spouses aboard the Titanic
- Number of parents / children aboard the Titanic
- Ticket number
- Passenger fare
- Cabin number
- Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
Our first task is to remove any columns that are unique to a particular passenger and, therefore, do not correlate to survival. So, we can remove the following:
- Passenger id
- Passenger name
- Ticket number
- Cabin number
Next, we’ll remove any rows where data are missing, such as Age or Port of embarkation. We could impute these values, but we will remove missing values for our initial analysis.
Our final step will be to convert several fields to a numeric format. For example, Gender will be converted as follows:
- 0 = female
- 1 = male
and Port of embarkation as follows:
- 0 = Q (Queenstown)
- 1 = C (Cherbourg)
- 2 = S (Southampton)
The final dataset consists of the following columns:
- Ticket class
- Gender
- Age in years
- Number of siblings / spouses aboard the Titanic
- Number of parents / children aboard the Titanic
- Passenger fare
- Port of embarkation
- Survived
... and 712 rows of data. The Survived column has been moved to the end and will be the last column.
We’ll now split the data into training data (80%) and test data (20%). As we have done in the previous articles in this series, we’ll use Scikit-learn to do this data splitting for us.
With our training and test data ready, we can start coding the application. You can download the code from GitHub if you would like to follow along. Our algorithm is therefore:
- Read the training data and test data
- Store the training data and test data in Ignite
- Use the training data to fit the K-Means Clustering model
- Apply the model to the test data
- Determine the confusion matrix and the accuracy of the model
Read the training data and test data
We can use the following code to read-in values from the CSV files:
private static void loadData(String fileName, IgniteCache<Integer, TitanicObservation> cache)
throws FileNotFoundException {
Scanner scanner = new Scanner(new File(fileName));
int cnt = 0;
while (scanner.hasNextLine()) {
String row = scanner.nextLine();
String[] cells = row.split(",");
double[] features = new double[cells.length - 1];
for (int i = 0; i < cells.length - 1; i++)
features[i] = Double.valueOf(cells[i]);
double survivedClass = Double.valueOf(cells[cells.length - 1]);
cache.put(cnt++, new TitanicObservation(features, survivedClass));
}
}
The code reads the data line-by-line and splits fields on a line by the CSV field separator. Each field value is then converted to double format and then the data are stored in Ignite.
Store the training data and test data in Ignite
The previous code stores data values in Ignite. To use this code, we need to create the Ignite storage first, as follows:
IgniteCache<Integer, TitanicObservation> trainData = getCache(ignite, "TITANIC_TRAIN");
IgniteCache<Integer, TitanicObservation> testData = getCache(ignite, "TITANIC_TEST");
loadData("src/main/resources/titanic-train.csv", trainData);
loadData("src/main/resources/titanic-test.csv", testData);
The code for getCache() implemented as follows:
private static IgniteCache<Integer, TitanicObservation> getCache(Ignite ignite, String cacheName) {
CacheConfiguration<Integer, TitanicObservation> cacheConfiguration = new CacheConfiguration<>();
cacheConfiguration.setName(cacheName);
cacheConfiguration.setAffinity(new RendezvousAffinityFunction(false, 10));
IgniteCache<Integer, TitanicObservation> cache = ignite.createCache(cacheConfiguration);
return cache;
}
Use the training data to fit the K-Means Clustering model
Now that our data are stored, we can create the trainer as follows:
KMeansTrainer trainer = new KMeansTrainer()
.withK(2)
.withDistance(new EuclideanDistance())
.withSeed(123L);
We set the value of k to 2 to represent the two clusters (survived and not survived). For distance measure we have several options, such as Euclidean, Hamming or Manhattan and we’ll use Euclidean in this case. We have also set the seed as 123.
We can now fit the K-Means Clustering model to the training data, as follows:
KMeansModel mdl = trainer.fit(
ignite,
trainData,
(k, v) -> v.getFeatures(), // Feature extractor.
(k, v) -> v.getSurvivedClass() // Label extractor.
);
Ignite stores data in a Key-Value (K-V) format, so the above code uses the value part. The target value is the Survived class and the features are in the other columns.
Apply the model to the test data
Next, we are ready to check the test data against the trained model. We can do this as follows:
int amountOfErrors = 0;
int totalAmount = 0;
int[][] confusionMtx = {{0, 0}, {0, 0}};
try (QueryCursor<Cache.Entry<Integer, TitanicObservation>> cursor = testData.query(new ScanQuery<>())) {
for (Cache.Entry<Integer, TitanicObservation> testEntry : cursor) {
TitanicObservation observation = testEntry.getValue();
double groundTruth = observation.getSurvivedClass();
double prediction = mdl.apply(new DenseLocalOnHeapVector(observation.getFeatures()));
totalAmount++;
if ((int) groundTruth != (int) prediction)
amountOfErrors++;
int idx1 = (int) prediction;
int idx2 = (int) groundTruth;
confusionMtx[idx1][idx2]++;
System.out.printf(">>> | %.4f\t | %.0f\t\t\t|\n", prediction, groundTruth);
}
}
Determine the confusion matrix and the accuracy of the model
Now we can compare how the model classifies against the actual survived values (Ground Truth) using our test data.
Running the code gives us the following summary:
>>> Absolute amount of errors 56
>>> Accuracy 0.6084
>>> Precision 0.5865
>>> Recall 0.9873
>>> Confusion matrix is [[78, 55], [1, 9]]
Can we improve upon these initial results? One thing we can try is to scale the features. In Scikit-learn and Ignite, we can use MinMaxScaler(), and applying this gives us the following summary:
>>> Absolute amount of errors 29
>>> Accuracy 0.7972
>>> Precision 0.8205
>>> Recall 0.8101
>>> Confusion matrix is [[64, 14], [15, 50]]
As part of further analysis, we should also investigate the relationship between Survived and features such as Age and Gender.
Summary
In the general case, K-Means Clustering doesn't suit supervised learning tasks. However, such an approach can be effective if classes are well separated. For our analysis, we were interested in two clusters: whether passengers survived or did not survive.
This concludes this series on Machine Learning with Apache Ignite. The reader is encouraged to try out the various examples provided with the Apache Ignite Machine Learning library.