A couple of days ago I tried to modify a html page using a DOM. When I tried to convert the page into a Document instance the parser threw some SAXExceptions, complaining about the structure of the document, sort of “this tag needs to be closed” and the like. The source was html output generated from Docbook. I neither had the time nor the intent to mess around cleaning up the generated html, but could remember there was something called Tidy. So I searched for a Java library, and there it was. JTidy, looking like an unmaintained project, but being the right tool to clean up a html page and transform it into valid xhtml. The API is pretty straight forward.
This is the implementation for converting (non-valid) html to a Document instance:
// Create instance final Tidy tidy = new Tidy(); // Remove presentational clutter (don't really know // what exactly that does, but sounds great ;-) tidy.setMakeClean( true ); // Use XHTML output tidy.setXHTML( true ); // Make document readable by indenting the elements tidy.setSmartIndent( true ); // The html document received by a get request final String s = ...; // Converting the page into a Document instance final Document document = tidy.parseDOM( new ByteArrayInputStream( s.getBytes() ) , null );
That’s it, by now you have your html as a Document instance that you can freely manipulate.
The only thing I noticed was that the method
node.setTextContent() does not work. But you can use
node.appendChild( document.createTextNode( ... ) )
, that does what you want.
The second part is about writing your Document to a string:
// Create a stream to write the output to final ByteArrayOutputStream outStr2 = new ByteArrayOutputStream(); // Write modified Document to an output stream tidy.pprint( document , outStr2 ); // Create a StringBuilder final StringBuilder builder = new StringBuilder(); // Write output stream content to string builder builder.append( new String( outStr2.toByteArray() , "UTF-8" ) ); // Create String final String validXHTML = builder.toString();
At the end of the block you have your valid XHTML in a String.