In a hurry? You can now use our HUp jquery plugin to read files in a chunked fashion as data URLs. Hooray!
Got a minute or two? Let’s talk about file read chunking, data URLs and base 64.
If you’ve been looking forward to the previously promised discussion about file reading/downloading to/uploading from IndexedDb – well, keep looking forward, it’s on the way. In the meantime, however, let’s take a quick look at a problem and it’s quick and easy solution, that emerged out of making file reading chunkable for the HUp plugin.
The Problem
The HUp plugin has, as one of its goals, sensible defaults – such that a user can just call it against whatever element they want to make into a file reader/uploader drag-and-drop point, and presto! Everything works, and in a reasonable manner.
When I added the ability to read files in chunks, to mirror and complement our ability to upload files in chunks, we hit a snag in regards to the above goal. Generally speaking, the expectation is that developers using the plugin to read files in will probably want to be able to use said file in the browser in some fashion, and one of the simplest representations that can be used directly in the browser is a base64-encoded data URL. You can expect to be able to pop it into the src attribute of a number of elements in a modern browser, and be able to do something with it right away. So, defaulting our read_method
to readAsDataURL
makes sense.
However! Now that we’re chunking file reads by default (since it offers benefits, some of which were discussed last time), if the file to be read is larger than our chunk_size
(which defaults to 1MiB), than it will be read as a number of chunks, which will need to be reassembled to be used in a src attribute as a data URL. Small bit of extra work for the developer, but a little string manipulation and you’re done, right?
Not quite.
It’s common knowledge that encoding binary data as base64 will result in an overall increase in size of about 33%. This is because, with a total of 26 different values, base64 can encode 6 bits per character (that is, the eight bits composing one character (in, for example, utf-8), are each used to represent 6 bits in the source binary).
So what’s the problem? We can end up with chunks that can’t be trivially recombined, and the details we just discussed regarding how many bits per character we get in base64-encoding explains why – if we’re not careful about where we slice the file chunks, we end up with each base64 chunk after the first out of alignment from the binary source. Attempting to simply concatenate them and use them will fail.
The Workaround
Recently, a user of the plugin brought up an unrelated problem, regarding the mime-type on the produced data URLs. In the process of fixing this, I started thinking about how chunking might affect the data URLs, and a few quick experiments showed me I was correct – chunking the files and encoding them as base64 caused problems. I didn’t have the time to fix the issue at that moment, however, so I mentioned a workaround to said user.
You, lucky reader, don’t need this workaround, since this issue is now addressed – however, it might be of future interest, particularly if you need/prefer to work with Array Buffers.
Since using an Array Buffer means we don’t need to worry about base64 encoding a file, we can slice and dice each chunk however we please, and concatenating the resulting array buffers poses no issue. So, instead of using readAsDataURL
as our read_method
, I suggested this user employ the readAsArrayBuffer
method instead.
What if he wants to be able to display or otherwise use the read-in file in the browser? Well, we could convert the concatenated array buffers into base64, but a much more straightforward method is to create a new Blob from the concatenated array buffers, and get an object url from said blob to display.
This would look something like the following, assuming for the sake of example that you have three array buffers which we’ll creatively name arr1, arr2 and arr3:
var blob = new Blob([arr1, arr2, arr3], {type:'mime/type'}),
url = URL.createObjectURL(blob);
Pretty easy, right?
The Solution
I wasn’t entirely happy with having to suggest this workaround, however, so when I had an hour or so free to revisit the question, I tried a little experiment. Could it really be as simple, I wondered, as making sure our chunks were aligned to the nearest multiple of 6 (remembering, as mentioned above, that we get 6-bits per character)?
Yes, yes it could be.
So, the defaults for HUp can now remain blissfully unchanged, and chunking works just fine – when readAsDataURL
is specified as the read_method, the plugin will transparently alter the total size of each chunk to the nearest multiple of 6 until we reach the end of the file, ensuring that the resulting base64 remains aligned with the binary source, and allowing trivial recombination of the data URLs.
I’ve also added a small convenience function to handle said recombination – check out the Github Repo for the details.