DrustZ的论文小课堂[番外篇:Rebuttal]

嘿嘿没想到吧,这周还有番外篇

根据论文小课堂的内容,你写好并提交了自己的杰作,期待能有一个好的结果——分数下来,定睛一看,发表似乎岌岌可危!本着对大家尽职尽责的原则,今天的课外补习班来讲讲如何写论文的Rebuttal,让你力挽狂澜,化险为夷,反败为胜!

一句话总结

Rebuttal对审稿人态度的三个历程:审稿人是孤儿 - 审稿人有一点点道理 - 审稿人都是再生父母

目录

  • 什么是Rebuttal (以及什么不是

  • 心理健康很重要!Rebuttal的三个阶段

    • 非常生气!

    • 心平气和!

    • 感激涕零!

  • 怎么写好一篇高分论文的rebuttal

什么是Rebuttal(以及什么不是

Rebuttal (申辩)是在论文被审稿人打分之后,留给作者解释各种问题的一个步骤。审稿人会根据论文的情况给出评审意见并打分,作者则需要根据这些意见把论文存在的问题解释清楚。人机交互领域的论文是5分制,3.5-5分为同意接收,低于3分则为拒绝接受——而Rebuttal就是给作者一个为自己辩解的机会,当分数比较低的时候,可以用rebuttal让大家回心转意,进而提高分数。

划重点:Rebuttal是解释审稿人的困惑、回答他们提出的问题、以及说服对方提高分数的机会。因此其内容也应该多以事实为主,而非主管评论。Rebuttal不是情绪发泄、哭爹喊娘、以及自吹自擂的地方——你可以写满“这篇论文石破天惊创立新的体系而你们这群瞎了狗眼的评审根本没有品位”,虽然结果一定是论文被拒收。

讲一下本人非常惨痛的代价:我的第一篇论文实现了导师认为不错的点子,但是第一次投评分只有2.5,2,3.5,1.5, 2.5 (四个人拒绝),而且给出的评论有些措辞非常傲慢。老板读完之后气炸了(因为那是他的点子hh,说这群**养的,这次的rebuttal我来写!然后我就收到了一封言辞激烈的rebuttal,虽然没有脏话,但是也把审稿人们的科学素养和智力水平怀疑了个遍。虽然提交之后很爽,但可想而知,分数依然没变,而且主要审稿人(他知道我们的身份)给我们回了同样言辞激烈的反馈,说你文章次也就算了,一点礼仪都没有,搞研究需要先学会尊重人(hhh。

所以!Rebuttal要尽量客观冷静,即使想要夹带感情,也要是积极的正能量!毕竟审稿人读完开心了才会给你涨分对吧(狗头!

心理健康很重要!Rebuttal的三个阶段

啊我亲爱的朋友,如果今天收到了rebuttal,不要悲伤,不要心急!记住心理健康最重要!生活美好!小小论文不值一提(勉强脸!

收到邮件一刹那,你需要做的,是迅速用目光扫过分数,如果满意的话:恭喜,你可以继续细看评论;否则立即退出邮箱,深呼吸!

然后请遵循以下三个阶段:

  • 非常生气!

这个阶段大概会持续一到两天:初读评审建议和分数的时候,会感到愤怒、失望、难受等情绪,并伴有间歇性想退学和想打人的念头,这些都很正常。这段时间内,建议最起码完整地快速地将评论读一遍,大致了解审稿人因为哪些点给了低分。然后快速退出邮箱,先把手头其他事做好。晚上可以留点时间喝点小酒,打打拳击,或者跟朋友吐槽一下审稿人都是弱智:骂的越狠,气会消得越快!别忘了把评审意见分享给其他合作者(比如老板,传播一下负面情绪!

  • 心平气和!

到第三天的时候,你的心情可能收拾得差不多了,这时候需要做下来,对评审意见进行细品。尽管有许多在你看来非常弱智的评论,要忍住怒火理性地读完(一字一句读完)。当你读的过程中不会跳起来突然骂娘(别笑,我经常这样)的时候,就可以开始下一个阶段了。

这个阶段,有两个目的:1.弄清楚各个评审都说了什么,哪些重要,哪些不重要 2.想办法解决他们提出来的问题

对于目的1,我的做法通常是:

  1. 精读一遍评论,然后用黄色把反馈里的问题、对论文的批评、以及各种需要改进的部分都高亮,用绿色把反馈里夸奖论文的地方高亮。

  2. 高亮完毕所有好和坏的点之后,把类似的点整合起来,进而整合成几个大问题。比如可能会有三个评审都质疑文章的创新性和贡献,那我就把这些点都整合起来;有两个人说数据的分析方法有漏洞,也把它们整合起来。可能会很多小点(比如只有一个人提出来,不是特别重要),我们把这些统一分到“其他”部分。

  3. 根据打分和评价将审稿人划分成小团体:亲密战友派、暗中观察派、没啥希望派;亲密战友派是给分高,并且评价中肯的审稿人。有些审稿人给分一般但是评价也很客观,给你一种如果他的问题被明确解答的话,分数会提高的感觉。暗中观察派适用于那些评价很少,没什么营养,或者打分模棱两可也没什么合理建议的审稿人。没啥希望派就是字面意思,评价低并且整了一堆有道理没道理的意见,就算写出诺奖级别的rebuttal也无法动摇其观念的审稿人。

  4. 这些流派中,最重要的就是亲密战友派,以及可以攻略成亲密战友派的暗中观察派。所以,他们的意见是最重要的,让这些人开心,先开心带动后开心——这是rebuttal的根本要诀。所以我们要把他们提出的疑惑、问题和建议统统列为重要,rebuttal中也会优先进行解答。

  5. 重要问题的另一个标志是,它出现在 1AC (主要审稿人)的meta-review中。一般1AC会综合所有人的评审列出几个大的问题:rebuttal中需要对这些重要问题一一解答。

  6. 解决完重要问题后,考虑其他问题的辨解方式,最后解答一些琐碎的小问题。在提纲阶段,每一个问题都要照顾到。

对于目的2,我们需要分情况讨论:

  1. 对于一些抽象的问题,例如有人觉得贡献和以前的工作比起来不清楚,或者创新太少,研究的问题不够重要——这种问题最常见。解答的方法依然是基于事实说话。觉得贡献不清楚,那我们就列出来123,并且引用论文的结果来证明。如果问题不够重要,那我们就引用一些相关的研究来证明其重要性。对于这种偏抽象的东西,事实胜于雄辩。

  2. 对于操作过程中的失误和疑问,例如实验中步骤不太对,进行数据分析的方法不太对等等:如果可以补救,那就迅速进行补救然后在rebuttal中写出新的结果。如果已经无法更改(比如没有记录某种数据,或者实验设计忘记了某个变量),那就承认错误并且想办法减小这些错误造成的影响。这一类的问题的可解决性与解决方法是最直观的。

  3. 还有一些问题,属于审稿人智障产出物。比如没看懂论文就质疑你的结论,或者某个结果随意抨击,又或者跟你瞎扯一些完全不着边的东西来质疑工作的可扩展性:这些问题有时候在他们看来还非常重要(毕竟理解上出现了偏差)。尽管可能不需要做额外的工作,在Rebuttal里也要好好解释清楚。

  4. 其他小问题,比如写作的语法,忘记某篇引用,或者图片的格式——统统都是好解决的东西,解决完在Rebuttal里提一嘴就好。

  • 感激涕零!

提纲列完之后,就是正式写Rebuttal的时候了!首先,Rebuttal 要以谦虚与讨论的态度行文,一切以尊重各位审稿人为前提。还是那句话,读完让人开心的rebuttal才有机会被加分!所以拿出小学去网吧被班主任逮到疯狂道歉认识错误的态度来(无意中透露了什么!

在文章中多用谦辞:We sincerely hope, we deeply appreciate, we respectively disagree 最忌讳的是直接戳着人鼻子说人家错了(即使这人真的是个弱智!比如想说R1 is totally wrong, 那就改成 We understand R1's concern, but ~ 或者想说 Does 2AC even understand the paper? 那就改成 Our presentation was not clear and caused 2AC's confusion. We will revise the paper crystal clear to show ~ 说好话会不会!

回应一些建议或者对方指出来的合理的错误时,一定要先承认自己的不足,然后“非常感谢R1能够指出这一点”!想再吹吹马屁的话可以写 ”we are grateful that R1 pointed out xxx, which certainly make the statement scientifically stronger“.

以及最后/开头 要对所有reviewer表达感激!尤其要感激亲密战友派的R1 and R3,还可以引用他们的原话,比如如果R1 夸了你的工作是一个seminal work (很有潜力),你可以直接把原话搬上去:we are encouraged by R1's comment "this is a seminal work in the xxx field"。毕竟用别人的夸奖来夸自己是最致命的!

以及最后在行文的时候,把整理出来的问题分点列出来进行解答。这样评审看的时候也会更方便(要记住他们看你的rebuttal已经距离审稿一个月,可能原文讲的啥都忘记了……

怎么写好一篇高分论文的rebuttal

这是谁问的问题!你吗?你出来!你分数够高了你在这儿炫耀什么?该干嘛干嘛去!(大雾

对于一篇反响不错的论文,写rebuttal的目的在于保住自己的阵地!要知道安全区的论文也是有一部分会被刷下来的!所以1.一定要写rebuttal 2.按照低分论文的标准写rebuttal:依然是总结问题,完整地解答,以及别忘了夸夸各位审稿人!


最后!重磅放出我最心惊胆战的一次rebuttal原文!感兴趣的同学可以读一下。这篇文章[1]的初始分数为2.5,1.5,3.5 和 2,基本上及没希望了。但是我们当初紧抓3.5的大腿不放,并且发现2.5和2的两位其实都不是特别了解,有点被1.5那位带节奏(而1.5那位的评价带有非常浓厚的个人偏见)……所以采用了本文的各种战术之后,rebuttal成功让这篇论文进入了讨论环节并且额外引入了一个新的审稿人,最终把评分刷到了4,1.5(那位1.5还是没改),4,3.5,4!所以认清楚可以攻略的reviewer是非常重要的!

这是原评审结果:

Reviewer 4 (AC)
Expertise
Knowledgeable
Recommendation
. . . Between possibly reject and neutral; 2.5
1AC: The Meta-Review
From reviews it seems that this paper is well written and that by addressing the SAT, is investigating an important area of research. However there is disagreement about whether this paper manages to provide useful insight into how to manage the tradeoff.

Two reviewers were unsure that this approach was a suitable way to address the SAT issue [R2, R3], by folding the matrix into a single metric there is concern that information is lost.  

R3 wanted more information about whether the participant manipulation was successful, is there evidence to suggest the different cognitive sets were observed in the experiment? And can you be sure that participants fully understood the possible tradeoffs available to them? 

R1 was unsure that this approach would be suitable for modelling individual differences in participants.

The reviewers provided some very comprehensive reviews with clear questions that need to be addressed by the authors. 

Rebuttal Opportunity
The authors should focus on the following points 
1. Does this approach (using a single metric) provide us with more information about the SAT than previous approaches? [R2]
2. How can you show that the participants truly were encouraged to use difference cognitive sets? And do the authors believe that participants were able to understand the possible tradeoffs? [R3]
3. What other areas of HCI (beyond text entry) could this be applicable to? [R1]

Reviewer 3 (2AC)
Expertise
Expert
Recommendation
. . . Between reject and possibly reject; 1.5
Review
The paper has the potential of making a good contribution to the field of HCI, but not the contribution that it currently aims to make. It would be great to help HCI researchers and practitioners to better understand the speed-accuracy tradeoff (SAT). A lack of awareness of this tradeoff shows up a lot, especially in papers that end up getting rejected. It would be a great contribution to show HCI researchers how to design an experiment with a speed-accuracy-tradeoff matrix, an experimental design that motivates high speed while maintaining a 97% accuracy (as recommended, for example, by Pachella, 1974, pp.59-60, cited in [5]); maintaining such an error rate across the use of two competing interaction techniques, for example, helps the two techniques to be directly compared by menas of their task-completion times.

This review will focus on the potential positive contribution that such a paper could make. It is important, however, to note that the stated goals of the paper, to conclude that speed and accuracy can be folded together into a single performance measure, even if this conclusion could be supported by the results, would make a negative contribution to the field. It is theoretically misguided to suggest that the two can be folded together, or to pursue this as a goal. The goal of the paper should not be to tell HCI researchers and practitioners that you can make things easier on yourself by using some new technique, but to educate HCI researcher and practitioners as to the problem of SAT, so that they can design better experiments. As stated in ([5], p.4) "the solution to this problem is to bring decision criterion under experimenter control." The goal is not to try to find some magical way to sidestep the problem.

This submission describes a valiant effort to bring the criterion under experimenter control, and that is the potential contribution. However, it appears as if the experiment could have done a better job in this effort. 

Experimental Design to Motivate Five Cognitive Sets

It is not clear that the reported study truly brought the participants' decision criterion under experimenter control. In other words, the paper aims to show the result of getting participants to use different "cognitive sets", but the paper does not convincingly demonstrate that the experiment successfully did so. In other words, the submissin needs to be more clear that participants truly used the "cognitive sets" that the paper presumes they were using. There are a number of conditions that, if it could be established that they were met, could help to establish this. The following need to be established:

1. It needs to be established that the device truly provided feedback (the green, black, or red responses, for example) that was accurately tuned to nudge participants towards one of the five cognitive sets. These feedbacks should be directly controlled by the payoff matrix; that is, the cost and benefit payments made for speed versus accuracy. These matrices are very difficult to set correctly. It needs to be established that participants could truly use the feedback to arrive one of the five cognitive sets. 

2. It needs to be established that the feedback provided, and the payoff matrix, were sufficiently sensitive to deviations from the intended performance such that participants could perceive there was good reason to move towards the correct strategy; for example, when the penalty for an error was something to care about. For example, nobody will about 0.05 cents across 100 trials. People will start to care about 5 cents across 100 trials.

3. It needs to be established that participants successfully used the feedback to truly arrive at one of the five cognitive sets. One important indicator of this would be whether the feedback provided to participants consistently improved during the course of a block. This is a critical piece of data if we are to believe that the task, as designed, was truly providing feedback that nudged participants to the "cognitive sets" that the paper claims the participants were using.

Payoff Matrix

Sections 6.3 and 6.5 section suggests that it is unlikely these three conditions were met. There are several reasons for this. These following issues point potential flaws in the experimental design that make it difficult or impossible to interpret the results:

1. The instructions provided to participants for each of the five SA conditions were not consistent with the payoffs that were used for each of the five SA conditions. For example, the instructions for "EF" (on Table 2) to "just ignore errors" should result in a pure guessing strategy; that is, random keystrokes. If such performance was not observed, and was not optimally rewarded, this would be a flaw in the experimental design and outcome. Table 3 shows that this was not the reward that was applied, and that errors were not ignored by the experimenters even though this was exactly the instructions provided to participants. This makes it difficult to conclude what "cognitive set" participants truly arrived at.

2. The payoff matrix cannot be explained or understood in any kind of a direct manner. Compare and contrast the payoff matrix described in Section 6.4 with those described in, for example, Pachella and Pew (1968), Swanson and Briggs (1969), and Lyons and Briggs (1971), all of which are cited as examples of how to set a payoff matrix in [5]. The simplicity of the matrices in these three papers is as striking as the complexity of the matrix in this submission. The matrix in this submission is very complicated, and fills all of Section 6.4. Equation 13 is not directly interpretable by participants such that they could understand how to emphasize speed versus accuracy. And yet, for participants to understand the results of their performance choices, they need to understand the exact payout method. This is exactly what was done by Pachella and Pew (1968), Swanson and Briggs (1969), and Lyons and Briggs (1971). The payoff matrix must be interpretable by the participants. Some researchers have argued that the payoff matrix is almost the only thing that participants need to know to do a task with a particular cognitive set (Edwards, 1961).

3. The feedback methodology used in the experiment would have resulted in different frequencies of feedback provided to participants across different conditions. This is an important experimental confound that provides different opportunities to adjust behavior across conditions. Feedback appears to have been provided, for example, roughly 7 times per minute for smartphone plain EA, and 15 times a minute in laptop EF.

4. The eventual grouping of EA with A, and EF with F, suggests that the experimental design did not succeed in motivating participants to arrive at five unique cognitive sets.

The paper states that "all of our participants understood the bonus scheme" (p.10) but does not offer any strong evidence of this.

Results

The results section provides some evidence that the experiment successfully brought participants to a range of different SATs, and this is potentially useful to the HCI community. However, rather than using this as a basis for suggesting that experimenters should combine speed and accuracy into a single measure, there are much more important lessons to be drawn and shared:

1. The error rates of the EA and A conditions for the laptop keyboard are very low because each is simply on a different point of the horizontal line that corresponds to "extreme accuracy emphasis" in Pachella (1974, Figure 4, cited in [5]). Once performance hits this horizontal line, as is explained in Pachella, any differences in RTs are meaningless. The submission should use this data along with a correct interpretation of SATs, and how they affect human performance, to point out that the any differences to be seen in the EA and A conditions for the laptop keyboard, including in Speed and AdjWPM (but not Throughput because this measure should not be reported) are thus meaningless and uninterpretable. The differences across these two conditions just represent the difference between waiting until you are sure you are providing a correct response, and waiting just a little longer to make really sure you are providing a correct response. This will be an arbitrary difference.

2. The only useful comparisons that can be made between the laptop keyboard and the smartphone keyboards would be the ones in which the performances result in roughly 95% accuracy. This, again, can be explained based on the theory summarized nicely by Pachella (1974, cited in [5]) Figure 4. Fortunately, an N condition for each keyboard produced roughly equivalent error rates, and so these conditions could potentially be used for a comparison. It is somewhat impressive to see 93.7% accuracy for the laptop and 93.9% for the plain smartphone; these appear to result in the only WPMs can be directly compared without concern of an SAT. This would be a great result for the HCI researchers and practitioners to see.

The results suggest the potential for a contribution but it would be more impressive if the experiment provided a more clear and deliberate example of how to get your participants to arrive at common error rates across two devices. The experimental design, as discussed, in its current form, does not provide a clear or exemplary path to do so.

Information Theory

The extensive discussion of information theory throughout the paper is scientifically regressive. This kind of language and this way of thinking does not advance the field. The paper should not advance information theoretic interpretations without at least also acknowledging that this characterization of human behavior embraces a regressive view of the human, consistent with the "dark ages" of behaviorism (Meyer et al., 1988), and that this characterization has been broadly discredited in the cognitive science literature. For example:

"During those years [the 1950s] I personally became frustrated in my attempts to apply Claude Shannon’s theory of information to psychology. After some initial success I was unable to extend it beyond Shannon’s own analysis of letter sequences in written texts. The Markov processes on which Shannon’s analysis of language was based had the virtue of being compatible with the stimulus–response analysis favored by behaviorists. But information measurement is based on probabilities and increasingly the probabilities seemed more interesting that their logarithmic values, and neither the probabilities nor their logarithms shed much light on the psychological processes that were responsible for them." (Miller, 2003)

"The information theory approach exemplified by Hick and Hyman viewed the human as a passive information channel, which amounted to a denial that interesting processes go on internally." (Lachman et al., 1974, p.142)

"Information theorists (e.g., Hick, 1952) use a complex formula for combining CRT and error information. It measures information transmitted. We do not present the information-transmission metric as a solution to the speed-accuracy problem. It is based on the conception that humans are passive communication channels, which they are not. A more promising approach is to construct speed-accuracy operating characteristics, by varying people's mental set and deriving empirical functions similar to the idealized one presented in Fig. 5.15. Having done that, one can estimate the theoretical ideal from his data. But this process is time-consuming and, for some purposes; unnecessary." (Lachman et al., 1974, p.161)

All told, the paper shows the potential of a great contribution but would need substantial work to do so, and so I am compelled to rate it quite low.

References

Edwards, W. (1961). Costs and payoffs are instructions. Psychological Review, 68(4), 275-284.
Reviewer 1 (reviewer)
Expertise
Expert
Recommendation
. . . Between neutral and possibly accept; 3.5
Review
Title: Text Entry Throughput: Towards Unifying Speed and Accuracy in a Single Performance Metric

The paper introduces a new information theoretic measure of text throughput. It reports theory and an experimental study. This is a significant and rigorous paper that makes a good contribution to HCI. It is well presented and easy to read. 

One concern is that the work may find an unduly narrow audience (people interested in information theoretic models of text entry). This would be a shame. This concern could be addressed with a substantive review and discussion of the relevant literature on tradeoffs in other aspects of HCI and human performance in general (Smith et al., 2008, Payne and Howes, 2013). Smith for example reports a model of a substantive range of trade-off strategies for typing. Payne and Howes provide a number of examples of adaptive behaviour in HCI. They also discuss reward schemes for running experiments similar to that used in the submission. Additionally, some discussion of the implications of the reported analysis for text entry systems (e.g. Kristensson, 2018) might be useful. 

More importantly, I am concerned that the single measure of throughput makes it difficult for the model to account for variation in criteria and difficult to account for individual differences. 

Refs.

Smith, M.R., Lewis, R.L., Howes, A., Chu, A. and Green, C. (2008). More than 8,192 ways to skin a cat: Modeling behavior in multidimensional strategy spaces. In B.C. Love, K. McRae, & V. M. Sloutsky (Eds.), Proceedings of the 30th Annual Conference of the Cognitive Science Society. Austin, Tx: Cognitive Science Society.

Payne, S.J. & Howes, A. (2013). Adaptive Interaction: A utility maximisation approach to understanding human interaction with technology. Morgan Claypool.

Kristensson, P.O. 2018. Statistical language processing for text entry. In Oulasvirta, A., Kristensson, P.O., Bi, X. and Howes, A. (Eds.), Computational Interaction. Oxford: Oxford University Press, 41-61. (pdf) (publisher)
Reviewer 2 (reviewer)
Expertise
Knowledgeable
Recommendation
Possibly Reject: The submission is weak and probably shouldn't be accepted, but there is some chance it should get in; 2.0
Review
This paper intends to provide a technical/methodological contribution to HCI. The authors explain how throughput can be used as a single metric to combine the speed and accuracy of text entry. 

The positives of this paper are:
- the methodology is explained well
- the work is embedded well in technical literature 

In general, this is a nice write-up. However, I have one large concern. Let me try to explain it as accurate as I can. In the optimistic case, the authors can then hopefully convince me in their rebuttal where my reasoning goes wrong.
In short, my question is: why do we actually need this metric? (beyond that it’s a different metric, which might be a purpose in its own right)

I share the sentiment that neither accuracy nor speed is a sufficient metric on its own for most studies of typing behavior. However, does this new metric really add something new?
Looking at the results on Figure 8 (and the post-hoc tests of throughput described on page 9), the metric mostly seems to differentiate “neutral” typing from “extremely accurate” or “extremely fast”. By contrast, the accuracy and speed metrics were able to differentiate each of the five conditions/levels from each other.

In other words, the new metric can distinguish “average performance” from extreme performance (i.e., extremely fast, extremely accurate), but not a lot more. What is the value of being able to do this? If we compare two interfaces, we could identify if they are one of the extreme cases. Typically, those would then be considered “bad” interfaces. But, more realistically, if both are somewhere in the middle zone, the method will not be able to differentiate them. So, we still can’t identify the “better” interface (because the scores typically tend to not be different statistically — at least in your test). Small differences are a hard argument for saying one interface is better than the other.

Moreover, if we were to use speed and accuracy (without the reformulation) we could also identify extreme cases, AND we would be able to explain *why* they are extreme cases. For example, one interface could be preferred over the other, because it does not meet a desired speed criterion, or it does not meet a desired accuracy criterion. Such an explanation of *why* is not possible with the new metric, as different levels of “middle level” are hard to distinguish.

And yes, making a selection based on speed and/or accuracy instead of one single metric does require some interpretation AND might become subjective. However, a decision of what is “good” can be context dependent, right? Sometimes speed is a more important concern (i.e., quickly getting a really, really important text message sent while driving a car; sending a quick reply to a friend per text), 
sometimes accuracy is more important (i.e., crafting a perfect letter as part of a job application, sending a careful mail to a professor).

Such careful considerations need a measurement of speed and of accuracy, and the decision is made based on those metrics, not on an aggregate, transformed metric.

So: Is there a further need beyond being able to identify a “poor” interface from a “not poor” interface and is this something that we can only do with this new metric? Or could we have already made that decision with the existing metrics?
Or, in other words, from my understanding, the new metric is less good at differentiating different conditions AND we lose information about WHY something is poor or not when decisions are made based on the metric.


Smaller points:
- the number of participants mentioned in the abstract is different from that in the study
- the study design is reported as a “between subjects” study. However, for a typical between-subjects study, I would expect a direct (statistical) analysis of differences between two groups. This is not the case here. Moreover, it seems like it is hard to do, because the two groups had different levels of the "cognitive sets" factor.
My recommendation would be to report the study as two studies (on two interfaces), of which each study had a within-subjects manipulation (cognitive sets / instructions / criterion). The two studies differ in the levels for the within-subjects manipulation.

- you can’t change it anymore, but it is surprising to me that participants that work 1.5 hours get the same payment as participants that work for 1 hour (both 15 dollars). Isn't that unfair?

- with the appendix as part of the main text, the paper would be 11 pages. Not sure what CHI guidelines are on this, but probably this is too much?

- the font of the paper looks a lot smaller than my other papers… Did the authors tweak this?
(I know that there are different templates, but this seems wrong?)

这是牛逼闪闪的Rebuttal!

Thank you, reviewers! We appreciate R1’s comment that our work is "a significant and rigorous paper that makes a good contribution to HCI." Indeed, we spent several years on the theoretical challenge, with the goal of contributing a useful new text entry performance metric. We were not attempting to characterize the speed-accuracy tradeoff (SAT) or illuminate human mental or physical processes. It seems our objective might have been unclear; we can and will clarify it in our revision.

A: THE GOAL OF A THROUGHPUT PERFORMANCE METRIC (R2, 2AC)

A1. Does this approach (using a single metric) provide us with more information about the SAT than previous approaches? [R2]

We emphasize that the goal of our work is *not* to measure or model the SAT (as other researchers have done). Our goal is to devise, theoretically and empirically, a robust *text input performance measure* for comparing input methods in the presence of speed vs. error biases that arise in text entry studies. Some amount of bias cannot be avoided even with experimental control. What we present is a more robust metric than WPM or even AdjWPM.

A2. WHY USE THROUGHPUT? [R2, 2AC]

The 2AC suggested that experimenters should compare different input methods using only completion time by "maintaining an error rate (at 97%)." Indeed, it'd be great if every participant in text entry studies performed with constant accuracy, but this is not feasible regardless of how text entry studies are run. We're not sidestepping an issue—rather, we're addressing a pragmatic issue—that two metrics (speed and accuracy), which are at odds, are currently used to characterize text entry performance—but we provide an efficiency metric that is less affected by humans’ SAT biases, thus better demonstrating the efficiency of a text entry method. Throughput does not conflict with speed/accuracy, but gives a holistic picture of the information transmission rate.

R2 questioned the usefulness of our metric. In a real text entry study, one usually compares different text entry methods. Throughput gives a way of equitably comparing them, even across studies. Please see Table 1, which motivates the problem. The metric is not for distinguishing between extreme and neutral conditions. We compared these conditions under the premise that for a person with an input method, throughput should be stable even if speed and errors vary. Vitally, our goal is to provide a performance metric that is less affected by the SAT because it remains stable. Stability shows that the metric captures efficiency, rather than overly favoring speed or accuracy.

B: EXPERIMENT DESIGN AND OUTCOME (2AC)

B1. How can you show that participants truly were encouraged to use different cognitive sets? And do the authors believe that participants were able to understand the possible tradeoffs? [2AC]

If the phrase "cognitive set" was confusing (we took it from Fitts (1966), who used the phrase in the same way we mean), we can change it to "speed or error bias." The actual amount of bias toward speed or accuracy is shown in Figs. 8-9, where one can see that across conditions from extremely accurate (EA) to extremely fast (EF), participants became faster and less accurate. The same patterns hold for pointing tasks with Fitts' law, whose speed-accuracy correction is in Crossman (1957) and Welford (1968). (See also Zhai IJHCS 2004.) Our work is like that of Crossman (1957), but for text entry, not pointing.

We ensured all participants understood EA-EF before the experiment, having them demonstrate their typing strategy in each condition. We agree our payoff scheme seemed complicated in the paper, but to participants, it was actually straightforward: from EA to EF, they got a smaller bonus for each phrase, but they also got a smaller error penalty, so they could enter more phrases quickly and worry less about errors. Again, Figs. 8-9 confirm that participants increased speed and decreased accuracy across EA-EF conditions, which was all that was required for our study’s manipulation, so that we could show the stability of throughput.

C: RELEVANCE AND AUDIENCE [R1]

R1 was concerned that the audience for this work would be narrow. But the audience isn't only "people interested in information-theoretic models of text entry," but anyone evaluating a text entry method. Our metric would be similarly useful as WPM or error rates. Thus, the audience is any text entry researcher or evaluator. There were 209 papers reporting text entry performance outcomes in CHI ‘16-‘18. Any of them might benefit from an additional metric of input efficiency, esp. if they had a SAT present in their results.

D: INFORMATION THEORY (2AC)

We agree with the 2AC that Shannon's model is not illuminating for human mental or physical processes. But in our work, we are not using Shannon's model in that way. We are using Shannon's model for what it does—characterize the information communicated through a noisy channel. In our case, the channel is a text entry method, and we are characterizing the performance of that channel. This is a *good* use of Shannon’s model, unlike using it to elucidate cognitive phenomena, which we emphatically are *not* doing. We will revise the paper wherever it might have given the impression of advocating a "dark age" approach. We will note the decline of Shannon's IT in psychology (even Shannon's own comment about the "bandwagon" effect of overusing IT). But text input is literally a communication channel of the kind Shannon’s theory is made for.

MISC:
We thank the reviewers and esp. the 2AC for related work, and will include these in our revision.

Finally, throughput does not substitute for speed and accuracy. As we say in the Conclusion: "At the same time, speed and accuracy should also be reported..."

We sincerely hope this rebuttal helps reviewers appreciate the purpose and value of our work. We are confident it would make a strong contribution to the text entry community. Thank you!

然后看到Rebuttal之后各位reviewer的回复:

1AC:
Recommendation
Possibly Accept: I would argue for accepting this paper; 4.0
Rebuttal response
The rebuttal has been received well by some reviewers but remains problematic for R3 who still does not believe this paper represents a positive contribution to the HCI literature. Despite this, other reviewers all feel strongly positive about the paper and believe it is an interesting piece of work.

After extended discussion it was decided that this paper should be accepted. However authors are encouraged to consider the points of R3 (and other reviewers) when preparing their camera ready version. Notably:

a) The paper and rebuttal state that this is a method for assessing solely the interface as a channel, however the user remains an important and variable part of this channel. How to the authors address this?
b) The paper does not do a good job of highlighting that this is an additional metric and appears for large portions of the paper to imply that it should be used to replace SA measures.

This paper was debated in discussions and we therefore look forward to this work being discussed further in public in the future!

2AC:
Recommendation
. . . Between reject and possibly reject; 1.5
Rebuttal response

I greatly appreciate the positive tone of the rebuttal even though it did not appear to me to successfully counter any of the criticisms posed by the 2AC review.

The rebuttal states that “it'd be great if every participant in text entry studies performed with constant accuracy, but this is not feasible regardless of how text entry studies are run.” This is patently untrue. There is no basis for this statement. This is an unhelpful statement to put forward in an HCI research community. Furthermore, if a text entry study *cannot* be run at 97% accuracy, it is evidence that the text entry technique itself is fundamentally flawed.

There is no confusion about the use of the term “cognitive set”. It is a fine term. All of the criticisms in the 2AC review still apply with the rephrasing of the term as "speed or error bias." The rebuttal restates that participants demonstrated an understanding of the payoffs and does not address the criticism that, if this were true, in the “EF” condition participants should have issued random keystrokes but they did not. It is impossible that participants truly understood the payoff and sought to maximize it. It is concerning to see the rebuttal suggest that this was not required.

Regarding the relevance of the work, it would be a negative contribution to the field, and make text entry studies meaningless and unusable, if such studies were to somehow take the advice offered in this paper to report “efficiency” rather than speed and accuracy. Yes, it is true, as stated in the rebuttal, that the very last sentence of the paper undermines the goal of the paper stated in the abstract: “This work allows researchers to characterize text entry performance with a single unified measure of input efficiency.” Yes, in the last sentence of the paper, it is stated that speed and accuracy should also be reported. If this next sentence here cannot appear in the abstract, the paper should be rejected: “Although speed and accuracy should also be reported, as they provide essential practical information, this work allows researchers to characterize text entry performance with a single unified measure of input efficiency.”

I appreciate the rebuttal pointing out that “We are using Shannon's model for what it does—characterize the information communicated through a noisy channel. In our case, the channel is a text entry method, and we are characterizing the performance of that channel.” But this is not at all what is being characterized. The “channel” is everything that occurs between the presentation of the stimulus and a keystroke being entered, and there is a human in that “channel”. A text entry technique does not exist in a vacuum. It is only an input method if there is a human to activate it. And so the “noisy channel” characterization includes the human, and all of the criticisms made in the 2AC review of the paper’s use of information theory to characterize human performance stand.

Reviewer 5 (2AC)
Expertise
Knowledgeable
Recommendation
Possibly Accept: I would argue for accepting this paper; 4.0
Review
This paper addresses an important question: how do we best evaluate different text input methods such that our evaluations are not biased by the speed-accuracy tradeoff?

I found this paper really exciting to read. The paper proposes an evaluation metric by elegantly adapting the work of Shannon in information theory. The paper proceeds to explain step-by-step how to apply the new metric. Finally, it demonstrates that the metric can perform well by differentiating between two text input method that we can, from experience, easily differentiate ourselves too: a laptop keyboard and a phone keyboard with virtual keys.

I will keep this review short - the other reviewers provided excellent comments/suggestions, and the rebuttal responds well to these. Based on the paper, the reviewer comments, and the rebuttal, I would recommend accepting this paper.

Reviewer 1
Recommendation
. . . Between neutral and possibly accept; 3.5
Rebuttal response
(blank)

Reviewer 2
Recommendation
Possibly Accept: I would argue for accepting this paper; 4.0
Rebuttal response
Post-rebuttal response:
I want to thank the reviewers for their carefully crafted rebuttal. It is now clear to me what the intended contribution of this paper is and I believe that it has been met. I have therefore increased my score from a 2.0 to a 4.0.

Before I thought that information was lost because the metric only allows distinguishing "average" (or balanced) conditions from extreme conditions. However, I now understand that this is by design, that the authors want to introduce this metric for that specific purpose and that this has added value.

Let me add that I really want to encourage the authors to dedicate sufficient space in their final paper to make the argument about text bias. Some more examples or detailed elaboration / context on the downside of working with such biases in text entry tasks can really help a wider audience to appreciate the contribution that you make in this paper. For example, although you refer to Table 1 as motivating the problem, one interpretation of Table 1 is "so, what?". Yes, the table shows that different methods lead to different conclusions (and yes, this can be important), but a little more narrative is needed here I believe. You need to explain what the consequences are of such decisions. Now it is just a mathematical example without any context.

Anyway, the work is solid, but take our (or at least my) feedback from a positive perspective (I'm trying to help :-)) and use it to articulate your argument even better in the paper. I look forward to (hopefully) reading the final version of this paper soon!

到今天为止!有关论文写作的部分应该是真正地暂告一段落了!希望大家都能够写出惊天地泣鬼神的rebuttal,让审稿人看到自己论文的真正价值!

本文引用: [1] Mingrui Zhang, Shumin Zhai, Jacob O. Wobbrock: Text Entry Throughput: Towards Unifying Speed and Accuracy in a Single Performance Metric, 2019

最后更新于