The Similarity Index quantifies the textual differences between sections in a given company's annual filings on an "as disclosed" basis. For example, similarity scores are calculated by comparing sections within a company's 2017 10-K with the 2016 10-K.
Intuitively, firms breaking from routine phrasing and content in mandatory disclosures give clues about their future performance which eventually drive stock returns over time. This data set captures significant changes in disclosure texts in the form of low similarity scores.
Academic research has shown that a portfolio that shorts low similarity scores and longs high similarity scores earns non-trivial and uncorrelated returns over a period of 12-18 months. Full details can be found in the paper Lazy Prices.
Jaccard and Cosine similarity scores are generated for each section within an annual filing, including but not limited to Risk, Legal, MD&A, and Business Overview.
All listed US entities that are regulated by the U.S. Securities and Exchange Commission (SEC)