This data set captures significant changes in disclosure texts in the form of low similarity scores. Intuitively, firms breaking from routine phrasing and content in mandatory disclosures give clues about their future performance which eventually drives stock returns over time.
The data set offers 56 fields that measure linguistic features for each of the sections (Risk, MD&A, Legal etc.,) within a 10-K filing. These linguistic features include readability, textual complexity, language voice (active and passive), vocabulary variation, emphasis, and disclosure tone.
The data set records the date in which a firm files a Non-Timely notification with the SEC.
Repeated inability to file annual/quarterly disclosures on time flags ongoing governance issues and general competence of the firm.
For every 10-K filed by firms reporting to the SEC, the text is cleaned and parsed into different sections within a 10-K (Risk, Legal, MD&A etc.,). These sectional texts are stored and can be retrieved in real-time once the filing has been made.