The data set offers 56 fields that measure linguistic features for each of the sections (Risk, MD&A, Legal etc.) within a 10-K filing. These linguistic features include readability, textual complexity, language voice (active and passive), vocabulary variation, emphasis, and disclosure tone.
Coupled (trained) along with other data sources, linguistic features provide more color in discerning deception and obfuscation by the disclosing firm.
All listed US entities that are regulated by the U.S. Securities and Exchange Commission (SEC)