Title: Studying Text Revision in Scientific Writing

 

Date/Time: December 16th, 2024, 4 PM -- 6 PM EST (1 PM -- 3 PM PST)

Location: Coda C1115 Druid Hills

Zoomhttps://gatech.zoom.us/j/9861666067?pwd=MkpxYWRjUUdJeWxDUHBzUmF5RVI5Zz09&omn=92372876609  (Meeting ID: 986 166 6067   Passcode: 964018)

 

Chao Jiang (Homepage)

Ph.D. Candidate in Computer Science

School of Interactive Computing

Georgia Institute of Technology

 

Committee:

Dr. Wei Xu (advisor), School of Interactive Computing, Georgia Tech

Dr. Alan Ritter, School of Interactive Computing, Georgia Tech

Dr. Kartik Goyal, School of Interactive Computing, Georgia Tech

Dr. Nanyun Peng, Computer Science Department, UCLA

Dr. Cheng Li, Google DeepMind

 

Abstract:

Writing is essential for sharing scientific discoveries, and researchers devote significant effort to revising their papers to improve writing quality and incorporate new findings. The revision process encodes valuable knowledge, including logical and structural improvements at the document level and stylistic and grammatical refinements at the sentence and word levels. This dissertation presents a complete computational framework for extracting text revisions across different granularity, and analyzing edits made for different purposes.

 

Extracting human-made revisions requires accurately matching text snippets before and after editing. In this talk, I will first present our contribution to the state-of-the-art methods for monolingual sentence alignment. We propose a neural CRF model that captures sequential dependencies and semantic similarity between sentences in parallel documents. The proposed approach outperforms previous methods by a large margin, and enables the creation of high-quality text revision datasets. Next, to study fine-grained editing operations within sentences, we design a novel neural semi-Markov CRF alignment model for monolingual word/phrase alignment. This model unifies word and phrase alignments using variable-length spans and achieves state-of-the-art performance on both in-domain and out-of-domain evaluations.  It also demonstrates utility in downstream tasks, such as automatic text simplification and sentence pair classification tasks.

 

We further present arXivEdits, a dataset containing human-annotated sentence alignments and fine-grained span-level edits across multiple versions of 751 research papers. Enabled by this corpus, we perform detailed analysis of revision strategies in scientific writing, revealing common practices researchers use to improve their paper. Finally, this dissertation explores human revision from a readability perspective through MedReadMe, a new dataset consisting of sentence-level readability ratings and complex span annotations for 4,520 medical sentences. This dataset supports fine-grained readability analysis and the evaluation of state-of-the-art readability metrics. By incorporating novel features, we significantly improve their correlation with human judgments.