Title: Studying Text Revision in Scientific Writing
Date/Time: December 16th, 2024, 4 PM -- 6 PM EST (1 PM -- 3 PM PST)
Location: Coda C1115 Druid Hills
Zoom: https://gatech.zoom.us/j/9861666067?pwd=MkpxYWRjUUdJeWxDUHBzUmF5RVI5Zz09&omn=92372876609 (Meeting ID: 986 166 6067 Passcode: 964018)
Chao Jiang (Homepage)
Ph.D. Candidate in Computer Science
School of Interactive Computing
Georgia Institute of Technology
Committee:
Dr. Wei Xu (advisor), School of Interactive Computing, Georgia Tech
Dr. Alan Ritter, School of Interactive Computing, Georgia Tech
Dr. Kartik Goyal, School of Interactive Computing, Georgia Tech
Dr. Nanyun Peng, Computer Science Department, UCLA
Dr. Cheng Li, Google DeepMind
Abstract:
Writing is essential for sharing scientific discoveries, and researchers devote significant effort to revising their papers to improve writing quality and incorporate new findings. The revision process encodes valuable knowledge, including logical and structural improvements at the document level and stylistic and grammatical refinements at the sentence and word levels. This dissertation presents a complete computational framework for extracting text revisions across different granularity, and analyzing edits made for different purposes.
Extracting human-made revisions requires accurately matching text snippets before and after editing. In this talk, I will first present our contribution to the state-of-the-art methods for monolingual sentence alignment. We propose a neural CRF model that captures sequential dependencies and semantic similarity between sentences in parallel documents. The proposed approach outperforms previous methods by a large margin, and enables the creation of high-quality text revision datasets. Next, to study fine-grained editing operations within sentences, we design a novel neural semi-Markov CRF alignment model for monolingual word/phrase alignment. This model unifies word and phrase alignments using variable-length spans and achieves state-of-the-art performance on both in-domain and out-of-domain evaluations. It also demonstrates utility in downstream tasks, such as automatic text simplification and sentence pair classification tasks.
We further present arXivEdits, a dataset containing human-annotated sentence alignments and fine-grained span-level edits across multiple versions of 751 research papers. Enabled by this corpus, we perform detailed analysis of revision strategies in scientific writing, revealing common practices researchers use to improve their paper. Finally, this dissertation explores human revision from a readability perspective through MedReadMe, a new dataset consisting of sentence-level readability ratings and complex span annotations for 4,520 medical sentences. This dataset supports fine-grained readability analysis and the evaluation of state-of-the-art readability metrics. By incorporating novel features, we significantly improve their correlation with human judgments.