Before we start building our solution, we need to make sure we have methods to evaluate it. We’ll use our objective here to determine the evaluation criteria.
Evaluation doesn’t just involve measuring how well we’re doing but we also need to think about what happens when our solution is incorrect.
Watch from 0:00 for a video walkthrough of this section.
For our task, we want to be able to suggest highly relevant tags (precision) so we don’t fatigue the user with noise. But recall that the whole point of this task is to suggest tags that the author will miss (recall) so we can allow our users to find the best resource! So we’ll need to tradeoff between precision and recall.
Normally, the goto option would be the F1 score (weighted precision and recall) but we shouldn’t be afraid to craft our own evaluation metrics that best represents our needs. For example, we may want to account for both precision and recall but give more weight to recall.
Fortunately, when we make a mistake, it’s not catastrophic. The author will simply ignore it but we’ll capture the error based on the tags that the author does add. We’ll use this feedback (in addition to an annotation workflow) to improve on our solution over time.
If we want to be very deliberate, we can provide the authors an option to report erroneous tags. Not everyone may act on this but it could reveal underlying issues we may not be aware of.
Watch from 1:38 for a video walkthrough of this section.