Determining how well our solution is performing over time.
Before we start building our solution, we need to make sure we have methods to evaluate it. We'll use our objective here to determine the evaluation criteria. - be clear about what metrics you are prioritizing - be careful not to over optimize on any one metric
Evaluation doesn't just involve measuring how well we're doing but we also need to think about what happens when our solution is incorrect. - what are the fallbacks? - what feedback are we collecting?
For our task, we want to be able to suggest highly relevant tags (precision) so we don't fatigue the user with noise. But recall that the whole point of this task is to suggest tags that the author will miss (recall) so we can allow our users to find the best resource! So we'll need to tradeoff between precision and recall.
|\(TP\)||# of samples truly predicted to be positive and were positive|
|\(TN\)||# of samples truly predicted to be negative and were negative|
|\(FP\)||# of samples falsely predicted to be positive but were negative|
|\(FN\)||# of samples falsely predicted to be negative but were positive|
Normally, the goto option would be the F1 score (weighted precision and recall) but we shouldn't be afraid to craft our own evaluation metrics that best represents our needs. For example, we may want to account for both precision and recall but give more weight to recall.
Fortunately, when we make a mistake, it's not catastrophic. The author will simply ignore it but we'll capture the error based on the tags that the author does add. We'll use this feedback (in addition to an annotation workflow) to improve on our solution over time.
If we want to be very deliberate, we can provide the authors an option to report erroneous tags. Not everyone may act on this but it could reveal underlying issues we may not be aware of.