blog

Photo of "Wird" by Marian Kroell on Unsplash

Word Processing in HTML

Clippy

I know a lot of people who hate word processors. For us web developers, we know how to optimally structure a web page and how to effectively apply cascading styles, so why can’t we ditch the word processor and simply use HTML?

With the power of HTML, CSS3, and some export libraries, we can do word processing by hand in a format much more convenient and familiar to us. We no longer have to sit there at the mercy of our word processing application, hoping that it interprets what we meant correctly and then fiddling with it until it does.

Advantages

Freedom! Tossing your word processing application out the window, you now have all the power, flexibility, and structure of HTML at your fingertips. You are in total control of the document structure and the style design/inheritance. This gives you the power to make your documents accessible to viewers with visual impairments and use responsive design to allow optimal viewing on all computing devices. You can also mix in multimedia and JavaScript to enhance your documents with interactive features that no word processor could even dream of.

Here are some more advantages:

Universal ability to view the document; all modern computers have a web browsing application.
Choice of editor; if it can type characters in the appropriate character encoding, it’ll let you modify your document.
Good integration with version control systems and other text-optimized utilities like diff.
Small document size.

Disadvantages

While there are many amazing things that you can do with HTML, a lot of these are relatively new to HTML and CSS. The main issue is the concept of a “page.” Web pages, despite their name, have no idea what a page is; they are just an infinitely tall and variably wide plane which elements are laid out on. CSS3 introduces some styles and functions for handling pages, but these haven’t found their way into most web browsers yet.

Another issue you’ll encounter is with transporting documents that have local media files attached. This can be somewhat mitigated by embedding as much as possible in the HTML document, but multimedia (especially images) and massive JavaScript libraries are best left outside the document. Putting the document in an archive format such as TAR and ZIP files solves the issue, but adds a step for the end user.

Some elements in HTML are extremely verbose relative to the amount of content they contain. One such example would be tables. One solution for this would be using an intermediate format such as Markdown to get around the tedious typing of this HTML.

One of the final disadvantages I can think of is a difficiency in most programers and programming editers: grammer and speling. Thankfully, plugins do exist to help us out here.

Getting Started

The first thing you’ll want to do is create an empty document shell using whatever variant of HTML you wish. You can add in title and meta fields to give the document some useful metadata such as a title, author, etc. The structure of the body of the document is entirely up to you as well.

Stylesheet

You’ll have to decide what works best regarding referencing external dependencies. Using an inline stylesheet has the benefit of encapsulating everything in one file, but could make it more difficult to manage once your document gets large and makes reusing the stylesheet more difficult. I found it best not to share the same stylesheet file among multiple documents due to the side effects of editing it and breaking an older document you wrote.

You’ll want to split up the stylesheet on different media types. First, you’ll have a base stylesheet that gets applied to everything, then I typically have one for screen media and another for page media. Most of your styles should go in the base stylesheet.

Common Use Cases and Workarounds

Writing the rest of the document should now be familiar since it’s as simple as a web page. However, here are a bunch of use cases that I’ve run into and my solutions to them.

Page Breaking

HTML is designed to define document structure, not page layout. This creates a challenge when working with pages, which sometimes require custom typesetting. There is a limit to how much custom typesetting you can do with HTML + CSS, but one of the most useful things is to the page-break-* CSS properties. This will give you some control over how an element is split up on the page.

For example, if you’re working with a section that has a header and a paragraph, you typically don’t want the header to print at the bottom of page 1 and the paragraph to start at the top of page 2. To solve this issue, you can wrap the text that should stay together.

.keep-together { page-break-inside: avoid; }
…
<div class="section">
   <div class="keep-together">
      <h3>Section 2</h3>
      <p>
         The quick brown fox jumps over the lazy dog.
      </p>
   </div>
   <p>
      More of section 2 will go here…
   </p>
</div>

In the above example, you could also solve this by forcing a section to start on a new page. Sometimes this is desirable depending on how you want the page to print out. To do this, you would just add some CSS code.

.section { page-break-before: always; }

Another method of resolving this issue is to use the orphans and widows styles. Orphans allows you to define the minimum number of lines of that element that must be able to fit on the bottom of the page. If the browser cannot fit that amount on the bottom of the page, it will force it onto the next page.

widowsdoes the same thing, but in regards to the minimum number of lines at the top of the page.

Page Headers and Footers

One of the most page-specific things you’ll have to deal with is adding in page numbers to your document. This is also the grayest area of HTML page support because it is relatively new and also conflicts with something that browsers already try to do when printing. This is mostly useful for using a PDF conversion tool.

CSS counters allow you to define variables that can be incremented and then displayed using counter-increment and content respectively. These can be used for a number of things such as figure, section, and page numbers. Regarding page numbers, there is a pre-defined page counter that can be used in browsers/pdf conversion tools that support this since it’s impossible for you to manually increment this counter.

Below is CSS that can be used with one of the PDF converters mentioned below to add page numbers to the bottom of the page.

@page
{
   @bottom
   {
      content: counter(page)
   }
}

This is not very well supported; some tools will have their own methods of defining page headers and footers that I’ll mention in the PDF Exporting section.

A table of contents is great on both screen and page media. However, there are some subtle differences in usability. On a screen, a link to an ID is the best way to get a user to a section. On a page though, links mean nothing. Instead, we must give the user the page number that the section is on. Therefore, we face two challenges.

First, we must get the page number that a section is on. We can get this with the target-counter CSS3 function. As far as I know, the only tool that supports this is PrinceXML. Below is an example of using this.


.introduction-nav:after { content: 'Page ' + target-counter('#section-   introduction', page, decimal); }
<ol>
   <li class="introduction-nav"><a href="#section-introduction">Introduction</a></li>
   <li class="dependencies-nav"><a href="#section-dependencies">Dependencies</a></li>
   …
</ol>
<div id="section-introduction">
   …
</div>

Second, is the addition of section numbers/letters. This can be performed automatically once again using CSS counters. From the HTML in the example above, we can add the following CSS to insert section letters:

ol
{
   counter-reset: section;
   list-style-type: none;
}
li:before
{
   counter-increment: section;
   content: counter(section, alpha);
}

If you want to implement nested numbering (e.g., list within a list having the number 1.1), you can use the counters CSS3 function:

ol
{
   counter-reset: section;
   list-style-type: none;
}
li:before
{
   counter-increment: section;
   content: counters(section, ".");
}

Sometimes, tools (like the PDF export tools mentioned next) will have table of content and header/footer features built in so that you don’t have to deal with all these issues with unsupported CSS styles.

PDF Exporting

Many times, people like distributing their documents in PDFs since they know their document will show up exactly the way they’re seeing it on their screen. To accomplish this with this HTML and CSS we have created, we have to go beyond a web browser’s print function.

A very solid, well-supported solution is PrinceXML. While this is free for personal use, it is relatively expensive for commercial use. A free alternative is wkhtmltopdf, which uses the webkit rendering engine to convert HTML to PDF.

My Personal Experience

I’ve been using HTML to do word processing now for a few years. Most of the experience has been writing papers for courses when I was in college. Luckily, I’ve been able to dodge having to insert table of contents in my documents and I’ve been able to use the browser’s print functionality to add in page numbers.

It worked out great and saved tons of time for having to write many documents that all followed the same template. I was able to make them professional-looking with consistent styles across the many documents. The biggest reason for me doing it was because it gave me a chance to play around with HTML and CSS while performing otherwise tedious tasks.

Spot the Vulnerability: Loops and Terminating Conditions

by Adam Singleton | Jan 7, 2022 | Developer Blog, Home Display

Spot the Vulnerability: Loops and Terminating Conditions In memory-unsafe languages like C, special care must be taken when copying untrusted data, particularly when copying it to another buffer. In this post, we\'ll spot and mitigate a past vulnerability in Linux\'s...

Accurate Timing

by Adam Singleton | Sep 24, 2021 | Developer Blog, Home Display

In many tasks we need to do something at given intervals of time. The most obvious ways may not give you the best results. Time? Meh. The most basic tasks that don't have what you might call CPU-scale time requirements can be handled with the usual language and...

Exploring Dependent Types in Idris

by Adam Singleton | Aug 27, 2021 | Developer Blog, Home Display

When I'm not coding the "impossible" at Art+Logic, I take a lot of interest in new programming technologies and paradigms; even if they're not yet viable for use in production, there can often be takeaways for improving your everyday code. My current...

« Older Entries