Data Sets

Similarity Index

This data set captures significant changes in disclosure texts in the form of low similarity scores. Intuitively, firms breaking from routine phrasing and content in mandatory disclosures give clues about their future performance which eventually drives stock returns over time. 

Linguistic Analytics

The data set offers 56 fields that measure linguistic features for each of the sections (Risk, MD&A, Legal etc.,) within a 10-K filing. These linguistic features include readability, textual complexity, language voice (active and passive), vocabulary variation, emphasis, and disclosure tone. 


Non-Timely Filings

The data set records the date in which a firm files a Non-Timely notification with the SEC.


 Repeated inability to file annual/quarterly disclosures on time flags ongoing governance issues and general competence of the firm.

SEC Filings Database

For every 10-K filed by firms reporting to the SEC, the text is cleaned and parsed into different sections within a 10-K (Risk, Legal, MD&A etc.,). These sectional texts are stored and can be retrieved in real-time once the filing has been made.

CIK to Ticker Mapping

This CIK to Ticker mapper can be stand-alone solution or a complementary mapper to extend existing ticker coverage for SEC related data sets.