blog

Photo by Joshua Sukoff on Unsplash

FileReader Chunking and Base64 DataURLs

by Christopher Keefer

In a hurry? You can now use our HUp jquery plugin to read files in a chunked fashion as data URLs. Hooray!

Got a minute or two? Let’s talk about file read chunking, data URLs and base 64.

If you’ve been looking forward to the previously promised discussion about file reading/downloading to/uploading from IndexedDb – well, keep looking forward, it’s on the way. In the meantime, however, let’s take a quick look at a problem and it’s quick and easy solution, that emerged out of making file reading chunkable for the HUp plugin.

The Problem

The HUp plugin has, as one of its goals, sensible defaults – such that a user can just call it against whatever element they want to make into a file reader/uploader drag-and-drop point, and presto! Everything works, and in a reasonable manner.

When I added the ability to read files in chunks, to mirror and complement our ability to upload files in chunks, we hit a snag in regards to the above goal. Generally speaking, the expectation is that developers using the plugin to read files in will probably want to be able to use said file in the browser in some fashion, and one of the simplest representations that can be used directly in the browser is a base64-encoded data URL. You can expect to be able to pop it into the src attribute of a number of elements in a modern browser, and be able to do something with it right away. So, defaulting our read_method to readAsDataURL makes sense.

However! Now that we’re chunking file reads by default (since it offers benefits, some of which were discussed last time), if the file to be read is larger than our chunk_size (which defaults to 1MiB), than it will be read as a number of chunks, which will need to be reassembled to be used in a src attribute as a data URL. Small bit of extra work for the developer, but a little string manipulation and you’re done, right?

Not quite.

It’s common knowledge that encoding binary data as base64 will result in an overall increase in size of about 33%. This is because, with a total of 2⁶ different values, base64 can encode 6 bits per character (that is, the eight bits composing one character (in, for example, utf-8), are each used to represent 6 bits in the source binary).

So what’s the problem? We can end up with chunks that can’t be trivially recombined, and the details we just discussed regarding how many bits per character we get in base64-encoding explains why – if we’re not careful about where we slice the file chunks, we end up with each base64 chunk after the first out of alignment from the binary source. Attempting to simply concatenate them and use them will fail.

The Workaround

Recently, a user of the plugin brought up an unrelated problem, regarding the mime-type on the produced data URLs. In the process of fixing this, I started thinking about how chunking might affect the data URLs, and a few quick experiments showed me I was correct – chunking the files and encoding them as base64 caused problems. I didn’t have the time to fix the issue at that moment, however, so I mentioned a workaround to said user.

You, lucky reader, don’t need this workaround, since this issue is now addressed – however, it might be of future interest, particularly if you need/prefer to work with Array Buffers.

Since using an Array Buffer means we don’t need to worry about base64 encoding a file, we can slice and dice each chunk however we please, and concatenating the resulting array buffers poses no issue. So, instead of using readAsDataURL as our read_method, I suggested this user employ the readAsArrayBuffer method instead.

What if he wants to be able to display or otherwise use the read-in file in the browser? Well, we could convert the concatenated array buffers into base64, but a much more straightforward method is to create a new Blob from the concatenated array buffers, and get an object url from said blob to display.

This would look something like the following, assuming for the sake of example that you have three array buffers which we’ll creatively name arr1, arr2 and arr3:

var blob = new Blob([arr1, arr2, arr3], {type:'mime/type'}),
    url = URL.createObjectURL(blob);

Pretty easy, right?

The Solution

I wasn’t entirely happy with having to suggest this workaround, however, so when I had an hour or so free to revisit the question, I tried a little experiment. Could it really be as simple, I wondered, as making sure our chunks were aligned to the nearest multiple of 6 (remembering, as mentioned above, that we get 6-bits per character)?

Yes, yes it could be.

So, the defaults for HUp can now remain blissfully unchanged, and chunking works just fine – when readAsDataURL is specified as the read_method, the plugin will transparently alter the total size of each chunk to the nearest multiple of 6 until we reach the end of the file, ensuring that the resulting base64 remains aligned with the binary source, and allowing trivial recombination of the data URLs.

I’ve also added a small convenience function to handle said recombination – check out the Github Repo for the details.

Recent posts

Categories

+ more

Legacy Vulnerabilities AKA Software Senescence

Legacy Vulnerabilities AKA Software Senescence

by Jason Bagley | Aug 20, 2021 | Developer Blog, Home Display

Does your business still have an XT computer in the back office because it's running that one version of some database software that your business depends on? Yeah, we know there is. Most modern software doesn't work like that. If you aren't keeping your custom...

Asynchronous Python – A Real World Example

Asynchronous Python – A Real World Example

by Daniel Popowich | Aug 13, 2021 | Developer Blog, Home Display

Introduction We have a customer that developed a hardware device to make physical measurements. Some years ago we wrote a suite of software tools for the customer: a tablet application for configuring the hardware device, a django web server to receive uploaded XML...

Spot the Vulnerability: Data Ranges and Untrusted Input

Spot the Vulnerability: Data Ranges and Untrusted Input

by Paul Hendry | Aug 6, 2021 | Developer Blog, Home Display

In 1997, a flaw was discovered in how Linux and Windows handled IP fragmentation, a Denial-of-Service vulnerability which allowed systems to be crashed remotely.

« Older Entries

Next Entries »