Implementing a computerized facial expression analysis system for automatic coding requires that a threshold for the system’s classifier outputs be selected. However, there are many potential ways to select a threshold. How do different criteria and metrics compare? Manually FACS coded video of 45 clinical interviews (Spectrum dataset) were processed using person-specific active appearance models (AAM). Support vector machine (SVM) classifiers were trained using an independent dataset (RU-FACS). Spectrum sessions were randomly assigned to training (n=32) and testing sets (n=13). Six different threshold selection criteria were compared for automatic A U coding. Three major findings emerged: 1) Thresholds that attempt to balance the confusion matrix (using kappa, Fl, or MCC) performed significantly better on all metrics than thresholds that select arbitrary error or accuracy rates (such as TPR, FPR, or EER). 2) AU detection scores for kappa, Fl, and MCC were highly intercorrelated; accuracy was uncorrelated with the others. And 3) Kappa, MCC, and Fl were all positively correlated with base rate. They increased with increases in AU base rates. Accuracy, by contrast, showed the opposite pattern. It was strongly negatively correlated with base rate. These findings suggest that better automatic coding can be obtained by using threshold-selection criteria that balance the confusion matrix and benefit from increased AU base rates in the training data.
When I wrote this paper back in 2011, I was just learning about performance evaluation. This was a first, and rather naive attempt at understanding the connection between agreement, prevalence, and threshold selection. Readers interested in more sophisticated approaches to these issues are encouraged to look up Guangchao Charles Feng, who has done nice work in this area.