Our challenge is to understand what each run of the evolving parser finds. We record matches as they are found but are prepared to discard most of these should we find too many. github
Discard
Consider how we would store 45 matches if we had only room for 10. We store every one until we overflow with number 11. We discard half and continue storing every other one until we overflow again, discard half and continue storing every fourth, then every eighth.
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, discard half 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, discard half 1, 5, 9, 13, 17, 21, 25, 29, 33, 37, discard half 1, 9, 17, 25, 33, 41
Using our regrade algorithm we would end up with six stored matches evenly distributed among the matches that occurred.
Allocation
To keep the program moving along we store only the byte offset of each match knowing that we can randomly seek through our corpus to find the specific text should ever become interested.
We record matches under two keys: the grammar rule name and the calling grammar rule name. This is how we capture the tree structure of the parse. We preallocate room for 1,000 name pairs and 10 regrade matches within each of these. 10,000 matches total.
Viewing
We keep an accurate count of how many caller-callee pairs have been tallied independently of the regrade discards. These label the arcs in our visualized parse tree.
If the count is ten or less, we know we can retrieve every match. Up to twenty, at least half.
A common practice is to subdivide categories of interest with additional rules until there are ten or fewer cases unaccounted for.