based on "How to review for ARR" by Anna Rogers, Isabelle Augenstein
modified by Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Before starting the review, please read the following:
- ACL Code of Ethics
- ACL policy on anonymization, review and citation
- ACL call for papers, which describes the scope of the invited submissions and instructions for the authors, and what sections are compulsory
- Responsible NLP Checklist
- If you are new to reviewing, we have a tutorial on the basics of the review process
- This tutorial: policies specific for ACL’23
Please have a look at the paper as soon as you get the assignment. If you notice any serious issues with the paper (format violations, missing compulsory sections, deanonymization through self-citations, acknowledgements or non-anonymous repositories), or you feel that the paper is too far out of your area of expertise, reach out to your senior area chair so that they can adjust the assignments or consider desk rejection. If you only look at the paper close to the deadline, it will be too late (and you’ll have to review it anyway, so checking early can save you work!).
The submission has to contain the actual paper, which may be accompanied by appendices, code and/or data. As a general rule, the paper should be readable as a standalone document, and any details important for understanding the key aspects of the work should be in the paper rather than in appendices, accompanying source code, etc. Non-key aspects (e.g. implementation details, parameters, examples or experiments that provide extra support for the main claims) can be provided in the supplementary materials, and you can consider them if you have questions. Each paper is also accompanied by the responsible NLP checklist (visible in the submission page). It can be viewed as a kind of FAQ for the paper (Are hyperparameters reported? Was the study IRB approved? Were the participants fairly compensated? etc.) If after reading the paper you have a question that falls in the scope of the checklist, please see if it is already answered to your satisfaction before raising it in your review.
In this tutorial:
- ACL 2023 review form
- Responsible NLP checklist
- Ethics review
- ACL 2023 peer review policies: writing a strong review
- The types of issues that the authors can flag reviews for
Note: The latter two sections were added to this post after the start of ACL’23 author response for completeness and in response to community feedback. We hope having all materials in one place would also prove useful to future chairs.
1. ACL 2023 review form
The full form can be viewed here.
Your review will be seen by three different audiences: (a) the authors, (b) the other reviewers and the area chair, and (c) the senior area chairs and the program chairs (this is discussed in more detail below, “writing a strong review”). Not all parts of the form are addressed to all of them.
What is this paper about?
The summary of a paper should be just that: a neutral, dispassionate summary of the research question and findings/contributions. We would help the program chairs the most by not tainting these summaries with, e.g., suggestions that the topic is exciting or a waste of time.
Make sure you acknowledge all the contributions that you believe the paper is making: experimental evidence, replication, framing of a new question, artifacts that can be used in future work (models, resources, code), literature review, establishing new cross-disciplinary connections, conceptual developments, theoretical arguments. A paper may make several contributions, and not all of them need to be equally strong. You should state in your own words what you see as contributions of the paper, rather than copy/paste it from the abstract. This helps the authors and the chairs to see that you understood the gist of the paper, and hence the rest of your review should be taken seriously.
Reasons to accept
Even if you fundamentally disagree with something in the paper, it is important for the program chairs (and the mental health of the authors) that you accurately state all the best aspects of it. It also helps them to see that you actually understand the work, and hence any criticisms should be taken seriously.
The contributions of the paper are the obvious candidates for the “reasons to accept”. They may come in many different forms and should be recognized, even if they are not the type of contribution that you aim for in your own work. ACL CFP explicitly welcomes the following kinds of contributions: an engineering solution, a literature review, a useful artifact (a model, software, a resource), reproduction of prior results, analysis (of models or data), theory, argumentation/position.
In assessing a contribution, performance improvements or complex math are by themselves neither necessary nor sufficient. It should be clear in what way the study advances the field: what did we learn from it that we did not know before? What can we do that we could not do before? Are the research claims clearly articulated and sufficiently backed up by the appropriate evidence, literature or theory/arguments?
If one of the strengths is the performance of the proposed system, do check that the baseline is sufficiently strong and well-tuned, and that the result is robust (e.g., not just the best of an unknown number of runs with unknown standard deviation between runs). Ideally, the computation budget should be reported. If you cannot find something in the paper, see if the responsible NLP checklist contains any answers or pointers to the paper or supplementary materials.
Apart from the contributions of the paper, its strengths may be in e.g. being clear and well-written, in formulating a timely/important question, in proposing a better framing of a problem, showing previously ignored connections between ideas/fields, a conceptual development, etc.
Reasons to reject
Many things could go wrong in a study. Although reviewers often disagree about acceptance recommendations for papers above a certain quality threshold, the NeurIPS experiment showed that low-quality papers are identified quite reliably. There may be claims that are not actually supported by the evidence or by the arguments, but that are presented as conclusions rather than as hypotheses/discussion. The framing may be misleading. There may be obvious methodological flaws (e.g., only the best run results are reported), errors in the proofs, in the implementation, or in the analysis. There may be insufficient detail to understand what was done or how to reproduce the method and the results. There may be a lack of clarity about what the research question is (even if it is “Does system A work better than system B”?), what was done, why, and what was the conclusion. The paper should also make it clear in what way the findings and/or the released artifacts advance the field.
At the same time, there are many complaints about reviews at previous conferences that cite as weaknesses things that are not in fact weaknesses, but rather belong in the “suggestions” box. Once you have the first draft of your review, you will need to revisit it to check for these frequent issues.
Every work has limitations, and ACL 2023 submissions include a mandatory section for discussing that. Please take care to not penalize the authors for seriously thinking through the limitations of their work, in ethical or any other aspect. First, this information is crucial for future work. Second, if we as a community reward focusing only on positive aspects of research, this contributes to the over-hyping problem which damages the credibility of the whole field.
Questions for the Author(s)
Do you have major questions for the authors - especially if their answers would be relevant for your recommendation? Please note that the authors will have limited space to answer, and this section should be reserved for the “big” questions - not things like typos or non-critical missing references.
Please use the letter numbering scheme for your questions, e.g.
Combined with your reviewer number, this will make it easy to refer to your questions in rebuttal and reviewer discussion (e.g. question 2A = question A of reviewer 2).
Please list any references that should be included in the bibliography or need to be discussed in more depth. If you mentioned missing previous work or lack of baselines, give the full citation. In particular, if you say that the submission lacks novelty, please be sure to include the specific work that you believe to be similar.
Typos, Grammar, Style, and Presentation Improvements
While the other fields are written for the chairs, this field is exclusively a message to the authors. This is the place for presentation suggestions, clarifications, pointing out (a reasonable amount of) typos or language errors: the small things that you do not consider “weaknesses” and should not distract the chairs from your overall assessment. Of course, if you find the paper completely illegible or unclear, such issues can escalate to the “weakness” list for the chairs.
Things could be unclear because it is generally not easy to communicate clearly and effectively with a wide audience—especially when you have been working on a project for a long time and it all seems obvious to yourself, but not to others. Reviewers are the beta-readers, who can help the authors to significantly improve their papers. If you see something that they could do to increase the potential impact of their work, it is professional courtesy to let them know. Some examples include figures that are hard to read or interpret, the points that require extra background reading, ambiguity, and non-obvious connections between sections. If you have comments or questions that you would like the authors to respond to (beyond “thanks, we’ll fix this”), please move them to the “questions for the authors” and include them in the letter-numbered questions.
Be careful in suggesting citations of your own papers. When you have truly relevant related work, this is appropriate, but it is also potentially de-anonymizing (especially when the paper in question is not very well-known). One workaround is to suggest several citations, including yours.
Reproducibility, Ethics Review, Anonymity Requirement and Overall Recommendation
The most important part of the review is your thorough comments. However, what many people focus on are numbers, particularly the “overall recommendation”. Thus, it would be remiss for our advice to overlook this part of the process.
ACL 2023 is trying something new this year. We are splitting this decision into two parts: how sound the paper is and how exciting you find the paper.
Soundness goes to how well the paper clearly states its claims and backs up them with evidence and argumentation. We want all papers to clear a minimum bar for presenting solid, thorough, technically sound research. We suspect that this will have lower variance than the other key metric, excitement. Please work with your fellow reviewers to come to a consensus: what are the claims of the paper and do you believe them?
Excitement captures the more subjective evaluations of the novelty/significance of the contributions, and their potential interesting-ness to the community. Because not everyone will calibrate their scores the same way (this will depend on the batch of papers you get, your research area, and the direction you think the field should go), make sure that your scores are justified in the text of the review.
The AC should be able to understand your justification of both scores to create a consensus score from all of the reviews. Please try to disentangle these two aspects in your review: e.g. even if you find a paper boring, it could still be sound and thorough.
As you assess the reproducibility of the paper, think about not just whether the raw information to reproduce the paper are included. Also think about how easy it would be to reproduce the paper: imagine you had to tell someone who had a strong CS or Linguistics background but was not immersed in the field how to reproduce the paper in an e-mail. Could you just hand them the submission? If not, how long would the e-mail be to fill in the gaps?
If your confidence is low, please use the confidential comments section at the end of the form to let your meta-reviewer know which aspects of your review might not be reliable.
The review form also allows you to flag the paper for potential breaches of ACL’s anonymity period, or indicate that you know the authors’ identity from some source.
Best Paper Award
Reviewers are often a little too conservative in their nominations. After all, you just written down all of the flaws in the paper. And you may think about all of the best papers you’ve seen… but only after the review process and incorporating all of the great suggestions reviewers like you can make. Rather than thinking about the question as “is this a best paper” as is, consider the question more as “how could you argue that this will be best paper”. It is okay if this paper is unique in topic or technique, that could be what makes it a best paper!
2. Responsible NLP checklist
This checklist serves a triple purpose. The reviewers can treat it as a kind of FAQ: if after reading the paper they have questions about reproducibility, approval, compensation to the study participants, licenses etc. - they can see if these questions have already been answered to their satisfaction, before writing a negative review. To the authors, it gives the space to preemptively answer such questions and to make sure that they weren’t forgotten during writing. Finally, since the ACL2023 checklist answers will be part of publication of accepted papers, they generally help to improve the transparency of NLP research.
For your convenience, here is the full list of questions that all authors answered as part of their submission.
A. For every submission:
- A1. Did you describe the limitations of your work?
- A2. Did you discuss any potential risks of your work?
- A3. Do the abstract and introduction summarize the paper’s main claims?
- A4. Did you use AI writing assistants when working on this paper?
B. Did you use or create scientific artifacts?
- B1. Did you cite the creators of artifacts you used?
- B2. Did you discuss the license or terms for use and / or distribution of any artifacts?
- B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified? For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?
- B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?
- B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.?
- B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?
C. Did you run computational experiments?
- C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?
- C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?
- C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?
- C4. If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?
D. Did you use human annotators (e.g., crowdworkers) or research with human participants?
- D1. Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?
- D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants’ demographic (e.g., country of residence)?
- D3. Did you discuss whether and how consent was obtained from people whose data you’re using/curating?
- D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?
- D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?
3. Ethics review
The general program committee members are not expected to be ethics experts. If you see something that you think could be problematic, you can flag the paper to be reviewed by a dedicated ethics review team. Generally, reviewers should look for issues with use of data sets and how they were collected, potential abuse, and disadvantages for minority groups. Please consider the relevant Ethics review questions and the Ethics FAQ, as needed. Many commonly asked questions relevant to the ethics review are a part of the Responsible NLP checklist, and so the authors have already answered them for you as part of their submission.
How do you tell if the authors’ answers are satisfactory? There is no single number for what counts as “fair pay”, nor a simple rule that would account for “legal grounds for data collection/processing”. The laws and practices at the authors’ home country and institution probably differ from yours, and it would be unfair to expect that the authors would follow exactly your own practice. So fundamentally the authors make their case for what they did, and try to convince you that this was appropriate – similarly to all other aspects of paper review. That being said, the broad principles of the ACM Code of Ethics need to be upheld, even for the authors from the countries that did not ratify the human rights convention. If you believe there may be a serious ethical issue, but you lack the expertise to evaluate it, you may flag the paper for a separate review process by the ACL ethics committee. They will make a recommendation as to whether the paper should be accepted, revised, or rejected on ethical grounds.
4. ACL 2023 peer review policies: writing a strong review
1. Be specific
If you would like to flag any issues, it should be specific and ideally understandable to the chairs without reading the full paper. Let us consider a few examples.
|❎ Instead of:
|The paper is missing relevant references
|The paper is missing references XYZ
|X is not clear
|Y and Z are missing from the description of X.
|The formulation of X is wrong
|The formulation of X misses the factor Y
|The contribution is not novel
|Highly similar work X and Y has been published 3+ months prior to submission deadline
|X was done in the way Y
|X was done in the way Y which has the disadvantage Z
|The algorithm’s (line 723) interaction with dataset 3 (line 512) is problematic
|It’s possible that using the decoding from Smith and Eisner (2006) on the Hausa newswire dataset, there might not be enough training data to rely on the n-best list.
The last example is fairly specific, but it doesn’t require extensive cross-referencing with the submission. The version on the right can be understood by an area chair (who has expertise in the subarea) without finding those line numbers.
2. Check for shortcuts
Judging whether a research paper is “good” is an objectively hard task. To avoid dealing with this problem, reviewers often resort to “shortcuts”, quick heuristics that serve as reasons to dismiss papers, such as the famous “reject if not SOTA.”
Here is a list of such heuristics derived from author complaints in recent NLP conferences, and why they are not always “weaknesses.” Many of them have even been studied in psychology and bibliometric research.
|Why this is problematic
|The results are not surprising
|Many findings seem obvious in retrospect, but this does not mean that the community is already aware of them and can use them as building blocks for future work.
|The results contradict what I would expect
|You may be a victim of confirmation bias, and be unwilling to accept data contradicting your prior beliefs.
|The results are not novel
|Such broad claims need to be backed up with references.
|This has no precedent in existing literature
|Believe it or not: papers that are more novel tend to be harder to publish. Reviewers may be unnecessarily conservative.
|The results do not surpass the latest SOTA
|SOTA results are neither necessary nor sufficient for a scientific contribution. An engineering paper could also offer improvements on other dimensions (efficiency, generalizability, interpretability, fairness, etc.) If the authors do not claim that their contribution is the SOTA status, the lack thereof is not an issue.
|The results are negative
|The bias towards publishing only positive results is a known problem in many fields, and contributes to hype and overclaiming. If something systematically does not work where it could be expected to, the community does need to know about it.
|This method is too simple
|The goal is to solve the problem, not to solve it in a complex way. Simpler solutions are in fact preferable, as they are less brittle and easier to deploy in real-world settings.
|The paper doesn't use [my preferred methodology], e.g., deep learning
|NLP is an interdisciplinary field, relying on many kinds of contributions: models, resource, survey, data/linguistic/social analysis, position, and theory.
|The topic is too niche
|A main track paper may well make a big contribution to a narrow subfield.
|The approach is tested only on [not English], so unclear if it will generalize to other languages
|The same is true of NLP research that tests only on English. Monolingual work on any language is important both practically (methods and resources for that language) and theoretically (potentially contributing to deeper understanding of language in general).
|The paper has language errors
|As long as the writing is clear enough, better scientific content should be more valuable than better journalistic skills.
|The paper is missing the [reference X]
|Per ACL policy, missing references to prior highly relevant work is a problem if such work was published 3+ months before the submission deadline (December 2022). Otherwise missing references belong in the "suggestions" section, especially if they were only preprinted and not published.
|The authors could also do [extra experiment X]
|It is always possible to come up with extra experiments and follow-up work. But a paper only needs to present sufficient evidence for the claim that the authors are making. Any other extra experiments are in the “nice-to-have” category, and belong in the “suggestions” section rather than “reasons to reject.” This heuristic is particularly damaging for short papers.
|The authors should have done [X] instead
|A.k.a. “I would have written this paper differently.” There are often several valid approaches to a given problem. This criticism applies only if the authors’ choices prevent them from answering their research question, their framing is misleading, or the question is not worth asking. If not, then [X] is a comment or a suggestion, but not a “weakness.”
If you have something like the above listed as a “weakness” of the paper, do try to revisit the paper with an open mind. Both the authors and the area chairs will be aware of these guidelines. You may be asked to revisit your review.
3. Check that your scores reflect the review
Look at your review and the numeric scores and check that your list of weaknesses actually justifies that score. If you give a low Soundness score without finding any major faults, this means that your review is not a faithful explanation of your recommendation. One reason for that could be that you secretly rely on some unconscious bias or heuristic for your evaluation, like the ones listed in the previous section.
Another reason why reviewers sometimes give “meh” scores without indicating serious problems is that they just do not find the topic interesting/exciting, even if the work is sound. At ACL 2023, this should not be a problem because you are asked to separately evaluate the soundness and the exciting-ness of the submission. If this paper is not your favorite kind of research - you can indicate that in the exciting-ness question, and let the soundness evaluation be untainted by that.
4. Check that the tone is professional and neutral
Above all, a good review should be professional and neutral, if not kind. Try **not **to think of it as a task of eliminating as many papers as possible, but as a task of identifying good research and helping authors improve their papers. As an anonymous reviewer, you are in the position of power, and the written medium makes it easier to be more dismissive and sarcastic than if you were communicating with the authors face-to-face.
You may well be addressing someone who is only starting on the research path. And/or someone struggling with stress and other mental health issues. And/or someone who has been less lucky with their school and advisors. Academic lives are already stressful enough, and we do not need to inflict extra damage to the mental health of our colleagues with sarcastic or insulting reviews. Write the review you would like to get yourself.
The fact is, authors and reviewers _are _actually the same people: the conference crowd lining up for coffee, happily chatting with each other about the grand problems of AI even if they do not always agree. Why can’t we do peer review in the same spirit of fun intellectual interaction with colleagues?
Just like the conference coffee break chats, reviewing is in fact mutually beneficial: in exchange for feedback, the reviewers get to have the first look at cutting-edge research. Just like the coffee breaks, where you do not necessarily meet only the people working in your subsubfield, peer review may expose you to new techniques. Since there is careful written communication in the authors’ response, you may even find that you were wrong about something, which is a much faster way to learn than making the same mistake in your own research.
Not to mention, someone out there is reviewing your papers too. The more rude reviews there are, the more of a norm they become, and the higher the chances you will get one yourself in the future.
5. Update your review after the author response and discussion!
As an author, it is rather depressing to spend a week carefully preparing responses to the reviews, only to see in the notification that the reviews did not change at all - and be left to wonder whether you failed to convince the reviewers or they simply did not even read the rebuttal. Importantly, this is also a waste of the reviewers’ effort: if the authors think that the response was simply ignored, they lose a valuable source of information (that their argumentation was not convincing to you).
Please check what you listed as major weaknesses, and see how the authors address your concerns. The form will contain a way to indicate whether your opinion changed or maybe even deteriorated. Even if your evaluation remains the same, this step makes it much more informative to the authors (with little effort on your part). The scores and the text of your original review will also be editable, and you can add a line to the review to say what you found convincing or not.
During discussion, when you see the reviews by the other reviewers, be aware that they may affect your own judgment in several ways because of social psychology effects. If you pointed out a weakness and then saw it also pointed out by someone else, you may feel it is a more severe problem than you originally thought. You may “discard” your own opinion if you see the opinion of someone you believe to be more senior/experienced/famous. If your original opinion was very positive, in the face of a strongly negative review you may be tempted to “converge to the mean” (or the other way round). Ask yourself if this is really warranted, and whether this may be reflecting some core dispute between approaches/methodologies etc. Strong disagreements may be an indication that the paper is doing something unusual and interesting.
Remember that the reviewer will not see the discussion on the submission site. So if you dramatically change your score because of the discussion, make sure explain why that happened in the text of the review.
5. The types of issues that the authors can flag reviews for
In rare cases reviews may be of unacceptably low quality, which violates the conference peer review policy. The authors can then use the “response to chairs” box in the author response form to report the type of the issue and explain your rationale to the chairs. This mechanism should only be used for serious issues. It is not in the authors’ interest to make their meta-reviewers investigate cases where the authors disagree with the reviewers, but the reviewers have done due diligence and provide their arguments/evidence/references.
The following types of issues are known from past conferences:
B. The review exhibits one of the heuristics discussed in the ACL23 review policy blog post, such as "not novel", "not surprising", "too simple", "not SOTA". Note that these criticisms may be legitimate, if the reviewer explains their reasoning, and backs up the criticism with arguments/evidence/references. Please flag only the cases where you believe that the reviewer has not done due diligence.
C. The scores do not match the review text. Note that in ACL23, the "soundness" score is meant to reflect the technical merit of the submission, and low soundness should be backed up with serious objections to the work. The "excitement" score is more subjective, and its justification may not be reflected in the text.
D. The review is rude/unprofessional
E. The review does not evince expertise (incl. texts that seem to be synthetic and not based on a deep understanding of the submission)
F. The review does not match the paper type (e.g. short paper expected to produce more experiments than is necessary to support the stated claim)
G. The review does not match the type of contribution (e.g. experimental work expected of a paper stating a different kind of contribution)
H. The review is missing or too short and uninformative
I. The review was late and could not be addressed in the author response
J. Other (please explain)
If you feel that you have such a problem, please use the following format to report it in the text box below (without the #comment lines, 250 words max). In this example, Reviewer 1 had issue A (unspecific review) and Reviewer 2 had issues C and D (rude review, scores don’t match the text).
# review problem type(s), as a capital letter corresponding to the issue type in the above list of possible issues. If there is more than one, list them comma-separated (e.g. A, I)
R1 states [reviewer statement], which we believe corresponds to the review issue type A. It is unreasonable in this case because [rationale].
R2 states [reviewer statement]...
FAQ: Can I use AI writing assistants to write my review?
If we only consider content, it could be reasonable to use writing assistance to paraphrase the review, e.g. to help the reviewers who are not native speakers of English. It goes without saying that the reviewer still has to fully read the paper and come up with the content of the review by themselves.
It is also reasonable to use tools that help to check proofs, explain concepts that the reviewer doesn’t know (as long as the explanations are correct and don’t lead the reviewer to misunderstand the submission). It has always been possible to check and run the code submitted by the authors.
That said, the content of the submissions and reviews is confidential, and some popular solutions (e.g., ChatGPT by OpenAI) rely on passing the information to through their API, where it will be retained. So – we highly discourage the use of ChatGPT and similar non-privacy-friendly solutions for peer review.