Posts tagged html
Jon and I were chatting the other day about a course he had recently attended. It had covered the common types of attacks against web based systems and good practice to defend against them. I was relieved that the results of the course could be summed up my existing knowledge:
Validate your inputs and validate your outputs
Anything coming into the system needs to be validated and anything leaving the system needs to be validated. With the LORLS v6 having a back-end system and multiple front-end systems things are a little more difficult. One front-end system may have a requirement to allow one set of HTML tags while another front-end system needed to not display some of those tags.
This lead us to the conclusion that the the back-end should make sure that it isn’t vulnerable to SQL Injection attacks and the front-ends should make sure it isn’t vulnerable to the XSS style of attacks.
This left me looking at CLUMP and trying to figure out what HTML tags should be allowed. After thinking about it for a while I came to the conclusion that this will need to be configurable as I was bound to miss one that would break an imported reading list. I also realised that, and that it will go deeper than tags, what attributes will each tag allow (we don’t really want to support the onclick type attributes).
The final solution we decided on is based around a configurable white-list. This lets us state which tags are accepted and which are dropped. For those accepted tags we can also define what attributes are allowed and provide a regular expression to validate that attributes value. If there is no regular expression to validate the attribute then the attribute will be allowed but without any value, e.g. the noshade attribute of the hr tag.
Getting the tag part working was easy enough, the problem came when trying to figure out what the attributes for each tag in the metadata were. After initially thinking about regular expressions and splitting strings on spaces and other characters I realized that it would be a lot easier and saner to write a routine to process the tags attributes one character at a time building up attributes and their values. I could then handle those attributes that have strings as values (e.g. alt, title, etc.).
As a test I put in altered an items author to contain
The a tag is currently allowed and the href and alt attributes are also allowed. The alt validation pattern is set to only allow alpha numeric and white-space characters while the href validation pattern requires it to start with http:// or https://. This is how CLUMP generates the a tag for the test entry.
The onclick attribute isn’t valid an so has been dropped, the href attribute didn’t start with http:// or https:// so has also been droped. The alt attribute on the other hand matches the validation pattern and so has been included.