Exploratory Parsing

We've built a methodology for mining streams based on the experimental construction of tolerant, high-speed, low-latency parsers. This is quite the opposite of how parser-generators are normally used.

Method

Given a corpus of semi-structured text, create progressively more refined parses of that text until all areas of interest have been explored.

while still curious run a refined parse explore what does and does not match

Expect to make small adjustments and learn from these at the rate that yields new insight every few minutes.

Tools

PEG Grammars let us describe what we expect and what we don't. The PEG approach to alternatives is simpler and more fit to this purpose than traditional grammars.

PegLeg Generator creates optimized C programs that carefully manage text as one efficient blob.

Regrade Tally that selectively keeps evenly spaced samples of every match.

Live Visualization shows structure and match counts as they emerge from the experiment.

Representative Subsets of a corpus constructed such that all extant cases in the grammar are present.

Parse Run Server hosted as a web application with drill-down and history.

Hover Diff between each experiment in the exploration.

Resources

Talks and Tutorials explaining our method.

Source Code Repos released as open-source.