在文章中多用谦辞:We sincerely hope, we deeply appreciate, we respectively disagree 最忌讳的是直接戳着人鼻子说人家错了(即使这人真的是个弱智!比如想说R1 is totally wrong, 那就改成 We understand R1's concern, but ~ 或者想说 Does 2AC even understand the paper? 那就改成 Our presentation was not clear and caused 2AC's confusion. We will revise the paper crystal clear to show ~ 说好话会不会!
回应一些建议或者对方指出来的合理的错误时,一定要先承认自己的不足,然后“非常感谢R1能够指出这一点”!想再吹吹马屁的话可以写 ”we are grateful that R1 pointed out xxx, which certainly make the statement scientifically stronger“.
以及最后/开头 要对所有reviewer表达感激!尤其要感激亲密战友派的R1 and R3,还可以引用他们的原话,比如如果R1 夸了你的工作是一个seminal work (很有潜力),你可以直接把原话搬上去:we are encouraged by R1's comment "this is a seminal work in the xxx field"。毕竟用别人的夸奖来夸自己是最致命的!
本文引用:
[1] Mingrui Zhang, Shumin Zhai, Jacob O. Wobbrock: Text Entry Throughput: Towards Unifying Speed and Accuracy in a Single Performance Metric, 2019
Reviewer 4 (AC)
Expertise
Knowledgeable
Recommendation
. . . Between possibly reject and neutral; 2.5
1AC: The Meta-Review
From reviews it seems that this paper is well written and that by addressing the SAT, is investigating an important area of research. However there is disagreement about whether this paper manages to provide useful insight into how to manage the tradeoff.
Two reviewers were unsure that this approach was a suitable way to address the SAT issue [R2, R3], by folding the matrix into a single metric there is concern that information is lost.
R3 wanted more information about whether the participant manipulation was successful, is there evidence to suggest the different cognitive sets were observed in the experiment? And can you be sure that participants fully understood the possible tradeoffs available to them?
R1 was unsure that this approach would be suitable for modelling individual differences in participants.
The reviewers provided some very comprehensive reviews with clear questions that need to be addressed by the authors.
Rebuttal Opportunity
The authors should focus on the following points
1. Does this approach (using a single metric) provide us with more information about the SAT than previous approaches? [R2]
2. How can you show that the participants truly were encouraged to use difference cognitive sets? And do the authors believe that participants were able to understand the possible tradeoffs? [R3]
3. What other areas of HCI (beyond text entry) could this be applicable to? [R1]
Reviewer 3 (2AC)
Expertise
Expert
Recommendation
. . . Between reject and possibly reject; 1.5
Review
The paper has the potential of making a good contribution to the field of HCI, but not the contribution that it currently aims to make. It would be great to help HCI researchers and practitioners to better understand the speed-accuracy tradeoff (SAT). A lack of awareness of this tradeoff shows up a lot, especially in papers that end up getting rejected. It would be a great contribution to show HCI researchers how to design an experiment with a speed-accuracy-tradeoff matrix, an experimental design that motivates high speed while maintaining a 97% accuracy (as recommended, for example, by Pachella, 1974, pp.59-60, cited in [5]); maintaining such an error rate across the use of two competing interaction techniques, for example, helps the two techniques to be directly compared by menas of their task-completion times.
This review will focus on the potential positive contribution that such a paper could make. It is important, however, to note that the stated goals of the paper, to conclude that speed and accuracy can be folded together into a single performance measure, even if this conclusion could be supported by the results, would make a negative contribution to the field. It is theoretically misguided to suggest that the two can be folded together, or to pursue this as a goal. The goal of the paper should not be to tell HCI researchers and practitioners that you can make things easier on yourself by using some new technique, but to educate HCI researcher and practitioners as to the problem of SAT, so that they can design better experiments. As stated in ([5], p.4) "the solution to this problem is to bring decision criterion under experimenter control." The goal is not to try to find some magical way to sidestep the problem.
This submission describes a valiant effort to bring the criterion under experimenter control, and that is the potential contribution. However, it appears as if the experiment could have done a better job in this effort.
Experimental Design to Motivate Five Cognitive Sets
It is not clear that the reported study truly brought the participants' decision criterion under experimenter control. In other words, the paper aims to show the result of getting participants to use different "cognitive sets", but the paper does not convincingly demonstrate that the experiment successfully did so. In other words, the submissin needs to be more clear that participants truly used the "cognitive sets" that the paper presumes they were using. There are a number of conditions that, if it could be established that they were met, could help to establish this. The following need to be established:
1. It needs to be established that the device truly provided feedback (the green, black, or red responses, for example) that was accurately tuned to nudge participants towards one of the five cognitive sets. These feedbacks should be directly controlled by the payoff matrix; that is, the cost and benefit payments made for speed versus accuracy. These matrices are very difficult to set correctly. It needs to be established that participants could truly use the feedback to arrive one of the five cognitive sets.
2. It needs to be established that the feedback provided, and the payoff matrix, were sufficiently sensitive to deviations from the intended performance such that participants could perceive there was good reason to move towards the correct strategy; for example, when the penalty for an error was something to care about. For example, nobody will about 0.05 cents across 100 trials. People will start to care about 5 cents across 100 trials.
3. It needs to be established that participants successfully used the feedback to truly arrive at one of the five cognitive sets. One important indicator of this would be whether the feedback provided to participants consistently improved during the course of a block. This is a critical piece of data if we are to believe that the task, as designed, was truly providing feedback that nudged participants to the "cognitive sets" that the paper claims the participants were using.
Payoff Matrix
Sections 6.3 and 6.5 section suggests that it is unlikely these three conditions were met. There are several reasons for this. These following issues point potential flaws in the experimental design that make it difficult or impossible to interpret the results:
1. The instructions provided to participants for each of the five SA conditions were not consistent with the payoffs that were used for each of the five SA conditions. For example, the instructions for "EF" (on Table 2) to "just ignore errors" should result in a pure guessing strategy; that is, random keystrokes. If such performance was not observed, and was not optimally rewarded, this would be a flaw in the experimental design and outcome. Table 3 shows that this was not the reward that was applied, and that errors were not ignored by the experimenters even though this was exactly the instructions provided to participants. This makes it difficult to conclude what "cognitive set" participants truly arrived at.
2. The payoff matrix cannot be explained or understood in any kind of a direct manner. Compare and contrast the payoff matrix described in Section 6.4 with those described in, for example, Pachella and Pew (1968), Swanson and Briggs (1969), and Lyons and Briggs (1971), all of which are cited as examples of how to set a payoff matrix in [5]. The simplicity of the matrices in these three papers is as striking as the complexity of the matrix in this submission. The matrix in this submission is very complicated, and fills all of Section 6.4. Equation 13 is not directly interpretable by participants such that they could understand how to emphasize speed versus accuracy. And yet, for participants to understand the results of their performance choices, they need to understand the exact payout method. This is exactly what was done by Pachella and Pew (1968), Swanson and Briggs (1969), and Lyons and Briggs (1971). The payoff matrix must be interpretable by the participants. Some researchers have argued that the payoff matrix is almost the only thing that participants need to know to do a task with a particular cognitive set (Edwards, 1961).
3. The feedback methodology used in the experiment would have resulted in different frequencies of feedback provided to participants across different conditions. This is an important experimental confound that provides different opportunities to adjust behavior across conditions. Feedback appears to have been provided, for example, roughly 7 times per minute for smartphone plain EA, and 15 times a minute in laptop EF.
4. The eventual grouping of EA with A, and EF with F, suggests that the experimental design did not succeed in motivating participants to arrive at five unique cognitive sets.
The paper states that "all of our participants understood the bonus scheme" (p.10) but does not offer any strong evidence of this.
Results
The results section provides some evidence that the experiment successfully brought participants to a range of different SATs, and this is potentially useful to the HCI community. However, rather than using this as a basis for suggesting that experimenters should combine speed and accuracy into a single measure, there are much more important lessons to be drawn and shared:
1. The error rates of the EA and A conditions for the laptop keyboard are very low because each is simply on a different point of the horizontal line that corresponds to "extreme accuracy emphasis" in Pachella (1974, Figure 4, cited in [5]). Once performance hits this horizontal line, as is explained in Pachella, any differences in RTs are meaningless. The submission should use this data along with a correct interpretation of SATs, and how they affect human performance, to point out that the any differences to be seen in the EA and A conditions for the laptop keyboard, including in Speed and AdjWPM (but not Throughput because this measure should not be reported) are thus meaningless and uninterpretable. The differences across these two conditions just represent the difference between waiting until you are sure you are providing a correct response, and waiting just a little longer to make really sure you are providing a correct response. This will be an arbitrary difference.
2. The only useful comparisons that can be made between the laptop keyboard and the smartphone keyboards would be the ones in which the performances result in roughly 95% accuracy. This, again, can be explained based on the theory summarized nicely by Pachella (1974, cited in [5]) Figure 4. Fortunately, an N condition for each keyboard produced roughly equivalent error rates, and so these conditions could potentially be used for a comparison. It is somewhat impressive to see 93.7% accuracy for the laptop and 93.9% for the plain smartphone; these appear to result in the only WPMs can be directly compared without concern of an SAT. This would be a great result for the HCI researchers and practitioners to see.
The results suggest the potential for a contribution but it would be more impressive if the experiment provided a more clear and deliberate example of how to get your participants to arrive at common error rates across two devices. The experimental design, as discussed, in its current form, does not provide a clear or exemplary path to do so.
Information Theory
The extensive discussion of information theory throughout the paper is scientifically regressive. This kind of language and this way of thinking does not advance the field. The paper should not advance information theoretic interpretations without at least also acknowledging that this characterization of human behavior embraces a regressive view of the human, consistent with the "dark ages" of behaviorism (Meyer et al., 1988), and that this characterization has been broadly discredited in the cognitive science literature. For example:
"During those years [the 1950s] I personally became frustrated in my attempts to apply Claude Shannon’s theory of information to psychology. After some initial success I was unable to extend it beyond Shannon’s own analysis of letter sequences in written texts. The Markov processes on which Shannon’s analysis of language was based had the virtue of being compatible with the stimulus–response analysis favored by behaviorists. But information measurement is based on probabilities and increasingly the probabilities seemed more interesting that their logarithmic values, and neither the probabilities nor their logarithms shed much light on the psychological processes that were responsible for them." (Miller, 2003)
"The information theory approach exemplified by Hick and Hyman viewed the human as a passive information channel, which amounted to a denial that interesting processes go on internally." (Lachman et al., 1974, p.142)
"Information theorists (e.g., Hick, 1952) use a complex formula for combining CRT and error information. It measures information transmitted. We do not present the information-transmission metric as a solution to the speed-accuracy problem. It is based on the conception that humans are passive communication channels, which they are not. A more promising approach is to construct speed-accuracy operating characteristics, by varying people's mental set and deriving empirical functions similar to the idealized one presented in Fig. 5.15. Having done that, one can estimate the theoretical ideal from his data. But this process is time-consuming and, for some purposes; unnecessary." (Lachman et al., 1974, p.161)
All told, the paper shows the potential of a great contribution but would need substantial work to do so, and so I am compelled to rate it quite low.
References
Edwards, W. (1961). Costs and payoffs are instructions. Psychological Review, 68(4), 275-284.
Reviewer 1 (reviewer)
Expertise
Expert
Recommendation
. . . Between neutral and possibly accept; 3.5
Review
Title: Text Entry Throughput: Towards Unifying Speed and Accuracy in a Single Performance Metric
The paper introduces a new information theoretic measure of text throughput. It reports theory and an experimental study. This is a significant and rigorous paper that makes a good contribution to HCI. It is well presented and easy to read.
One concern is that the work may find an unduly narrow audience (people interested in information theoretic models of text entry). This would be a shame. This concern could be addressed with a substantive review and discussion of the relevant literature on tradeoffs in other aspects of HCI and human performance in general (Smith et al., 2008, Payne and Howes, 2013). Smith for example reports a model of a substantive range of trade-off strategies for typing. Payne and Howes provide a number of examples of adaptive behaviour in HCI. They also discuss reward schemes for running experiments similar to that used in the submission. Additionally, some discussion of the implications of the reported analysis for text entry systems (e.g. Kristensson, 2018) might be useful.
More importantly, I am concerned that the single measure of throughput makes it difficult for the model to account for variation in criteria and difficult to account for individual differences.
Refs.
Smith, M.R., Lewis, R.L., Howes, A., Chu, A. and Green, C. (2008). More than 8,192 ways to skin a cat: Modeling behavior in multidimensional strategy spaces. In B.C. Love, K. McRae, & V. M. Sloutsky (Eds.), Proceedings of the 30th Annual Conference of the Cognitive Science Society. Austin, Tx: Cognitive Science Society.
Payne, S.J. & Howes, A. (2013). Adaptive Interaction: A utility maximisation approach to understanding human interaction with technology. Morgan Claypool.
Kristensson, P.O. 2018. Statistical language processing for text entry. In Oulasvirta, A., Kristensson, P.O., Bi, X. and Howes, A. (Eds.), Computational Interaction. Oxford: Oxford University Press, 41-61. (pdf) (publisher)
Reviewer 2 (reviewer)
Expertise
Knowledgeable
Recommendation
Possibly Reject: The submission is weak and probably shouldn't be accepted, but there is some chance it should get in; 2.0
Review
This paper intends to provide a technical/methodological contribution to HCI. The authors explain how throughput can be used as a single metric to combine the speed and accuracy of text entry.
The positives of this paper are:
- the methodology is explained well
- the work is embedded well in technical literature
In general, this is a nice write-up. However, I have one large concern. Let me try to explain it as accurate as I can. In the optimistic case, the authors can then hopefully convince me in their rebuttal where my reasoning goes wrong.
In short, my question is: why do we actually need this metric? (beyond that it’s a different metric, which might be a purpose in its own right)
I share the sentiment that neither accuracy nor speed is a sufficient metric on its own for most studies of typing behavior. However, does this new metric really add something new?
Looking at the results on Figure 8 (and the post-hoc tests of throughput described on page 9), the metric mostly seems to differentiate “neutral” typing from “extremely accurate” or “extremely fast”. By contrast, the accuracy and speed metrics were able to differentiate each of the five conditions/levels from each other.
In other words, the new metric can distinguish “average performance” from extreme performance (i.e., extremely fast, extremely accurate), but not a lot more. What is the value of being able to do this? If we compare two interfaces, we could identify if they are one of the extreme cases. Typically, those would then be considered “bad” interfaces. But, more realistically, if both are somewhere in the middle zone, the method will not be able to differentiate them. So, we still can’t identify the “better” interface (because the scores typically tend to not be different statistically — at least in your test). Small differences are a hard argument for saying one interface is better than the other.
Moreover, if we were to use speed and accuracy (without the reformulation) we could also identify extreme cases, AND we would be able to explain *why* they are extreme cases. For example, one interface could be preferred over the other, because it does not meet a desired speed criterion, or it does not meet a desired accuracy criterion. Such an explanation of *why* is not possible with the new metric, as different levels of “middle level” are hard to distinguish.
And yes, making a selection based on speed and/or accuracy instead of one single metric does require some interpretation AND might become subjective. However, a decision of what is “good” can be context dependent, right? Sometimes speed is a more important concern (i.e., quickly getting a really, really important text message sent while driving a car; sending a quick reply to a friend per text),
sometimes accuracy is more important (i.e., crafting a perfect letter as part of a job application, sending a careful mail to a professor).
Such careful considerations need a measurement of speed and of accuracy, and the decision is made based on those metrics, not on an aggregate, transformed metric.
So: Is there a further need beyond being able to identify a “poor” interface from a “not poor” interface and is this something that we can only do with this new metric? Or could we have already made that decision with the existing metrics?
Or, in other words, from my understanding, the new metric is less good at differentiating different conditions AND we lose information about WHY something is poor or not when decisions are made based on the metric.
Smaller points:
- the number of participants mentioned in the abstract is different from that in the study
- the study design is reported as a “between subjects” study. However, for a typical between-subjects study, I would expect a direct (statistical) analysis of differences between two groups. This is not the case here. Moreover, it seems like it is hard to do, because the two groups had different levels of the "cognitive sets" factor.
My recommendation would be to report the study as two studies (on two interfaces), of which each study had a within-subjects manipulation (cognitive sets / instructions / criterion). The two studies differ in the levels for the within-subjects manipulation.
- you can’t change it anymore, but it is surprising to me that participants that work 1.5 hours get the same payment as participants that work for 1 hour (both 15 dollars). Isn't that unfair?
- with the appendix as part of the main text, the paper would be 11 pages. Not sure what CHI guidelines are on this, but probably this is too much?
- the font of the paper looks a lot smaller than my other papers… Did the authors tweak this?
(I know that there are different templates, but this seems wrong?)
Thank you, reviewers! We appreciate R1’s comment that our work is "a significant and rigorous paper that makes a good contribution to HCI." Indeed, we spent several years on the theoretical challenge, with the goal of contributing a useful new text entry performance metric. We were not attempting to characterize the speed-accuracy tradeoff (SAT) or illuminate human mental or physical processes. It seems our objective might have been unclear; we can and will clarify it in our revision.
A: THE GOAL OF A THROUGHPUT PERFORMANCE METRIC (R2, 2AC)
A1. Does this approach (using a single metric) provide us with more information about the SAT than previous approaches? [R2]
We emphasize that the goal of our work is *not* to measure or model the SAT (as other researchers have done). Our goal is to devise, theoretically and empirically, a robust *text input performance measure* for comparing input methods in the presence of speed vs. error biases that arise in text entry studies. Some amount of bias cannot be avoided even with experimental control. What we present is a more robust metric than WPM or even AdjWPM.
A2. WHY USE THROUGHPUT? [R2, 2AC]
The 2AC suggested that experimenters should compare different input methods using only completion time by "maintaining an error rate (at 97%)." Indeed, it'd be great if every participant in text entry studies performed with constant accuracy, but this is not feasible regardless of how text entry studies are run. We're not sidestepping an issue—rather, we're addressing a pragmatic issue—that two metrics (speed and accuracy), which are at odds, are currently used to characterize text entry performance—but we provide an efficiency metric that is less affected by humans’ SAT biases, thus better demonstrating the efficiency of a text entry method. Throughput does not conflict with speed/accuracy, but gives a holistic picture of the information transmission rate.
R2 questioned the usefulness of our metric. In a real text entry study, one usually compares different text entry methods. Throughput gives a way of equitably comparing them, even across studies. Please see Table 1, which motivates the problem. The metric is not for distinguishing between extreme and neutral conditions. We compared these conditions under the premise that for a person with an input method, throughput should be stable even if speed and errors vary. Vitally, our goal is to provide a performance metric that is less affected by the SAT because it remains stable. Stability shows that the metric captures efficiency, rather than overly favoring speed or accuracy.
B: EXPERIMENT DESIGN AND OUTCOME (2AC)
B1. How can you show that participants truly were encouraged to use different cognitive sets? And do the authors believe that participants were able to understand the possible tradeoffs? [2AC]
If the phrase "cognitive set" was confusing (we took it from Fitts (1966), who used the phrase in the same way we mean), we can change it to "speed or error bias." The actual amount of bias toward speed or accuracy is shown in Figs. 8-9, where one can see that across conditions from extremely accurate (EA) to extremely fast (EF), participants became faster and less accurate. The same patterns hold for pointing tasks with Fitts' law, whose speed-accuracy correction is in Crossman (1957) and Welford (1968). (See also Zhai IJHCS 2004.) Our work is like that of Crossman (1957), but for text entry, not pointing.
We ensured all participants understood EA-EF before the experiment, having them demonstrate their typing strategy in each condition. We agree our payoff scheme seemed complicated in the paper, but to participants, it was actually straightforward: from EA to EF, they got a smaller bonus for each phrase, but they also got a smaller error penalty, so they could enter more phrases quickly and worry less about errors. Again, Figs. 8-9 confirm that participants increased speed and decreased accuracy across EA-EF conditions, which was all that was required for our study’s manipulation, so that we could show the stability of throughput.
C: RELEVANCE AND AUDIENCE [R1]
R1 was concerned that the audience for this work would be narrow. But the audience isn't only "people interested in information-theoretic models of text entry," but anyone evaluating a text entry method. Our metric would be similarly useful as WPM or error rates. Thus, the audience is any text entry researcher or evaluator. There were 209 papers reporting text entry performance outcomes in CHI ‘16-‘18. Any of them might benefit from an additional metric of input efficiency, esp. if they had a SAT present in their results.
D: INFORMATION THEORY (2AC)
We agree with the 2AC that Shannon's model is not illuminating for human mental or physical processes. But in our work, we are not using Shannon's model in that way. We are using Shannon's model for what it does—characterize the information communicated through a noisy channel. In our case, the channel is a text entry method, and we are characterizing the performance of that channel. This is a *good* use of Shannon’s model, unlike using it to elucidate cognitive phenomena, which we emphatically are *not* doing. We will revise the paper wherever it might have given the impression of advocating a "dark age" approach. We will note the decline of Shannon's IT in psychology (even Shannon's own comment about the "bandwagon" effect of overusing IT). But text input is literally a communication channel of the kind Shannon’s theory is made for.
MISC:
We thank the reviewers and esp. the 2AC for related work, and will include these in our revision.
Finally, throughput does not substitute for speed and accuracy. As we say in the Conclusion: "At the same time, speed and accuracy should also be reported..."
We sincerely hope this rebuttal helps reviewers appreciate the purpose and value of our work. We are confident it would make a strong contribution to the text entry community. Thank you!
1AC:
Recommendation
Possibly Accept: I would argue for accepting this paper; 4.0
Rebuttal response
The rebuttal has been received well by some reviewers but remains problematic for R3 who still does not believe this paper represents a positive contribution to the HCI literature. Despite this, other reviewers all feel strongly positive about the paper and believe it is an interesting piece of work.
After extended discussion it was decided that this paper should be accepted. However authors are encouraged to consider the points of R3 (and other reviewers) when preparing their camera ready version. Notably:
a) The paper and rebuttal state that this is a method for assessing solely the interface as a channel, however the user remains an important and variable part of this channel. How to the authors address this?
b) The paper does not do a good job of highlighting that this is an additional metric and appears for large portions of the paper to imply that it should be used to replace SA measures.
This paper was debated in discussions and we therefore look forward to this work being discussed further in public in the future!
2AC:
Recommendation
. . . Between reject and possibly reject; 1.5
Rebuttal response
I greatly appreciate the positive tone of the rebuttal even though it did not appear to me to successfully counter any of the criticisms posed by the 2AC review.
The rebuttal states that “it'd be great if every participant in text entry studies performed with constant accuracy, but this is not feasible regardless of how text entry studies are run.” This is patently untrue. There is no basis for this statement. This is an unhelpful statement to put forward in an HCI research community. Furthermore, if a text entry study *cannot* be run at 97% accuracy, it is evidence that the text entry technique itself is fundamentally flawed.
There is no confusion about the use of the term “cognitive set”. It is a fine term. All of the criticisms in the 2AC review still apply with the rephrasing of the term as "speed or error bias." The rebuttal restates that participants demonstrated an understanding of the payoffs and does not address the criticism that, if this were true, in the “EF” condition participants should have issued random keystrokes but they did not. It is impossible that participants truly understood the payoff and sought to maximize it. It is concerning to see the rebuttal suggest that this was not required.
Regarding the relevance of the work, it would be a negative contribution to the field, and make text entry studies meaningless and unusable, if such studies were to somehow take the advice offered in this paper to report “efficiency” rather than speed and accuracy. Yes, it is true, as stated in the rebuttal, that the very last sentence of the paper undermines the goal of the paper stated in the abstract: “This work allows researchers to characterize text entry performance with a single unified measure of input efficiency.” Yes, in the last sentence of the paper, it is stated that speed and accuracy should also be reported. If this next sentence here cannot appear in the abstract, the paper should be rejected: “Although speed and accuracy should also be reported, as they provide essential practical information, this work allows researchers to characterize text entry performance with a single unified measure of input efficiency.”
I appreciate the rebuttal pointing out that “We are using Shannon's model for what it does—characterize the information communicated through a noisy channel. In our case, the channel is a text entry method, and we are characterizing the performance of that channel.” But this is not at all what is being characterized. The “channel” is everything that occurs between the presentation of the stimulus and a keystroke being entered, and there is a human in that “channel”. A text entry technique does not exist in a vacuum. It is only an input method if there is a human to activate it. And so the “noisy channel” characterization includes the human, and all of the criticisms made in the 2AC review of the paper’s use of information theory to characterize human performance stand.
Reviewer 5 (2AC)
Expertise
Knowledgeable
Recommendation
Possibly Accept: I would argue for accepting this paper; 4.0
Review
This paper addresses an important question: how do we best evaluate different text input methods such that our evaluations are not biased by the speed-accuracy tradeoff?
I found this paper really exciting to read. The paper proposes an evaluation metric by elegantly adapting the work of Shannon in information theory. The paper proceeds to explain step-by-step how to apply the new metric. Finally, it demonstrates that the metric can perform well by differentiating between two text input method that we can, from experience, easily differentiate ourselves too: a laptop keyboard and a phone keyboard with virtual keys.
I will keep this review short - the other reviewers provided excellent comments/suggestions, and the rebuttal responds well to these. Based on the paper, the reviewer comments, and the rebuttal, I would recommend accepting this paper.
Reviewer 1
Recommendation
. . . Between neutral and possibly accept; 3.5
Rebuttal response
(blank)
Reviewer 2
Recommendation
Possibly Accept: I would argue for accepting this paper; 4.0
Rebuttal response
Post-rebuttal response:
I want to thank the reviewers for their carefully crafted rebuttal. It is now clear to me what the intended contribution of this paper is and I believe that it has been met. I have therefore increased my score from a 2.0 to a 4.0.
Before I thought that information was lost because the metric only allows distinguishing "average" (or balanced) conditions from extreme conditions. However, I now understand that this is by design, that the authors want to introduce this metric for that specific purpose and that this has added value.
Let me add that I really want to encourage the authors to dedicate sufficient space in their final paper to make the argument about text bias. Some more examples or detailed elaboration / context on the downside of working with such biases in text entry tasks can really help a wider audience to appreciate the contribution that you make in this paper. For example, although you refer to Table 1 as motivating the problem, one interpretation of Table 1 is "so, what?". Yes, the table shows that different methods lead to different conclusions (and yes, this can be important), but a little more narrative is needed here I believe. You need to explain what the consequences are of such decisions. Now it is just a mathematical example without any context.
Anyway, the work is solid, but take our (or at least my) feedback from a positive perspective (I'm trying to help :-)) and use it to articulate your argument even better in the paper. I look forward to (hopefully) reading the final version of this paper soon!