Figure 6 - available via license: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
Content may be subject to copyright.
Performance of Round 1 models learned from 10 different sets of feedback examples on the full test set: F1, F1 on the ground truth answerable subset (Ans. F1), F1 on the ground truth unanswerable subset (Unans. F1), classification accuracy (CLS Acc.), and percentage of predicted unanswerable outputs (%Unans.). Bars represent variance, and are centered at the mean value.
Source publication
We study continually improving an extractive question answering (QA) system via human user feedback. We design and deploy an iterative approach, where information-seeking users ask questions, receive model-predicted answers, and provide feedback. We conduct experiments involving thousands of user interactions under diverse setups to broaden the und...
Context in source publication
Context 1
... simulate deploying and improving the initial model for 10 times: we randomly sample 10 sets of 200 examples from the pool of 800 examples used in Section 6.3 and improve the initial model on each set separately. Figure 6 shows the performance of models learned from those 10 different sets. The overall F1 score, F1 on the answerable subset, and classification accuracy present relatively smaller variance (standard deviation σ 3.32, 2.89, and 2.44) on the full test set from Section 6.1. ...