Three useful measures to evaluate your machine learning system

In an earlier article, we explained what you should consider when you prepare a training set for your machine learning system. Once you have a set ready, you will want to train your system on it. An important part of training is the evaluation. To help you with the evaluation of your machine learning system, we will present you three useful measures to evaluate the performance of your system.

How can I evaluate a black box?

Many people think of machine learning systems as black boxes. You give some form of input, you receive some form of output — what happens in between, no one knows. While it is true that some machine learning systems do not give us direct reports about how they transform input into output, this does not necessarily mean that we can not observe and control what they are doing. With the right evaluation techniques, you can track the training of your machine learning system and adopt measures to improve it.

At turicode, we use three main measures to evaluate the training of our engine MINT.extract, which extracts valuable information from documents. The combination of these gives us a comprehensive insight into the learning of our system and allows us to adjust its training in an effective way. You do not have to be a data scientist to understand our combined evaluation tool. The measures we apply can all be represented in a way that you can read them at a glance.

1. The F1 score: the hard facts

Let us start with both the most important as well as the most commonly known measure to represent evaluation results of machine learning systems that classify data — the F1 score. It indicates how accurate the data labels predicted by a learning system are. The higher the value of the F1 score, the more reliably the system works.

The F1 score is a value between 0 and 1. It is based on precision and recall, which are also indicated with values between 0 and 1. The value of precision states how many of the data labels predicted by the machine learning system are set correctly, while recall measures how many of the relevant data points in a sample were actually found by the system.

But how do you calculate precision and recall? For this, you have to split your labelled data samples into two sets — a larger training set and a smaller evaluation set. After that, you let your system train on the labels in the training set and predict the labels for the evaluation set. If you compare the labels predicted by your system to the labels you set yourself, you can express the difference with values for precision and recall.

Let us make an example: If we have a sample document with 100 relevant data points and run our machine learning system on it, it might predict 96 data points. Let us say that 95 of these are labelled correctly, and one is an irrelevant data point. This means that the precision equals 0.99 (95 out of 96 labels are correct) and the recall equals 0.95 (95 out of 100 relevant data points were found). The F1 score is the harmonic mean between 0.99 and 0.95 and lies at 0.97.

At this point, we have to mention that the F1 score rarely ever reaches the value 1. Similar to a human mind, a well-trained machine learning system is right in most cases, but it also makes mistakes from time to time. You should try to train your system at least to the point where it works as reliable as a human annotator.

The F1 score tells you how well your system is doing in general, which is a great starting point. But if you want more detailed insights, you need other measures.

2. The learning curve: progress with added data

The learning curve is a visualization of the progress of your machine learning system in relation to the training data you feed it. It states how the F1 score — or any other score you work with — rises with additional training data. The vertical axis of the curve indicates the score, and the horizontal axis the number of samples your system trained on.

The learning curve comes in handy when you need to decide whether you need more training data or not. If the curve is pointing upwards as it does in the left part of the picture above, you can still improve the scores of your training by adding more data. If the curve flattens out as on the right, you either stop training and go into production, or — if the score is yet too low — you need to dig deeper to find the causes why your system does not improve anymore.

3. The confusion matrix: going into the details

Despite its name, the confusion matrix allows you the clearest insights into the training of your machine learning system. The confusion matrix is a table that compares the labels set by yourself with the labels set by the machine.

In the confusion matrix in the picture, we list the data points predicted by the machine on the vertical axis and the human-labelled ones on the horizontal axis. The diagonal line from top left to bottom right shows all the data points that the system predicted correctly. All the missed or false predictions are off this line. In this particular case, the system missed 45 descriptions that it did not label (non-data), but it labelled something else as a description once.

The confusion matrix allows you to evaluate where exactly your system failed in the training. If your system is working well on most data points, but experiencing trouble with a particular category, you can focus your training on this particular category.

Have a look into the black box

If you apply the F1 score, the learning curve and the confusion matrix to evaluate the training of your machine learning system, they will provide you with a lot of useful information on how your system is doing. With this information at hand, you do not need to be a data scientist to find the right course of action to improve the performance of your system.

To learn more about turicode visit or send us an email to

WRITTEN BY turicode Inc. | @turicode

Originally posted on