Validation and Normalization of Index information

While I was doing my daily walk this morning I was pondering the index loading code that I wrote yesterday and what else had to be done. Ostensibly the main thing was to add support for loading the table of contents specified in the index file, but the more I thought about it the more I realized that I was missing some things.

In particular although I had some validation in place to ensure that some aspects of the index were correctly formatted (is the topic list an array? is the first item a string?) there was a bunch of validation still missing. What if someone made the top level structure of the index an array, for example? I pondered the level of checks that would have to happen and how to try and shorten that code as much as possible, and then decided that something else was in order.

I went searching for some existing Python libraries that would be capable of taking a native object and verifying that it follows a particular schema. Here I’m less interested in things like “and this key has to be a number, and it has to be larger than 6 but smaller than 15” and more “does the structure of this data make sense?”.

Of large concern here were libraries that were fairly small, because I planned to inject whatever I found directly into the hyperhelp package itself, so that everything needed is self contained. A large part of this is that I don’t know if Package Control allows a dependency to depend on another dependency and I don’t want to have to depend on people using my dependency to also remember that they have to depend on something else.

After a bit of searching around, I landed on validictory, which largely seems to fit my needs perfectly. It uses what appears to be a large subset of the standard JSON schema format and nicely tells you if your data matches or not. Better still it’s implemented as only a couple of Python source files and was easy to modify to work from within a Sublime Text package.

The code now has a complete schema that applies to the help index file, so that before we bother looking at any data from within it we can tell if the structure is right or not and completely bail if it’s not. This has some great knock-on effects in the rest of the loading code, which can happily assume that everything is as we expect it to be.

A small downside of this is that the code is now much longer with a few object literals in it that describe the structure of the data. On the other hand, having that sitting right inside the loading code looks like a great reminder of exactly how the file is supposed to be laid out, which my maintenance programmer side sees as an enormous win.

Something else to consider is that the author of validictory is officially deprecating it as of the start of next year. I don’t think that’s a big deal for our needs here since the data set that it’s going to be exposed to is limited anyway. Additionally that saves on any future maintenance involved in having to upgrade it to a newer version, or having to repackage it in another fashion.

The only downside I’ve found so far is that it’s error messages are not very useful. In some cases it can tell you that a top level key is invalid or missing, while in other messages it includes the entire object literal in the error message, making it virtually useless. That’s not something that’s of any great concern at the moment, since the goal is just to ensure that the data is correct. The fact that it uses a schema format similar to other similar libraries means that it could easily be replaced with a fairly simple drop in replacement in the future.

Along with the validation code the table of contents code validation and expansion is now fully complete, which means that (barring any unforseen bugs or overlooked data items) everything is set to be able to collect help information for all packages and to navigate help in a more robust fashion than is currently happening.

The table of contents code is a big win for being able to assume that the data structure is valid before starting, since it needs to do a couple of things. Firstly any entry that’s just a string is expanded to the full topic, so that the contents is more compact internally. Secondly it has to be able to pull the associated topic caption from the main topic if the entry in the table of contents doesn’t have one, just to make life easier.

The big one is recursively applying the same rules to any child elements that might appear in the table of contents so that in the end the entire structure is fully expanded to proper topics and is ready to go. I chose an iterative design method for this, starting out with something large and obnoxious that did what I wanted and then slowly cleaning it up in bits and pieces to get it smaller and cleaner.

Overall I’m happy with the progress made today even though I didn’t get to actually using the index data for anything interesting. Even so I’m glad that I spent some time on the validation aspect since that seems like it will be a big win going forward not only for this project for also for any others that I might work on down the road.

Tomorrow we finally get to the loading of all available help indexes and starting on the core code that will allow you to navigate the help system in a more complete manner.