Pseudorandom number generators (PRNGs) are often treated like a compromise: their output isn’t as good as real
random number generators, but they’re cheap and easy to use on computer hardware. But a special feature of PRNGs is
that they’re reproducible sources of random-looking data:
Explicitly close files and sockets when done with them. Leaving files, sockets or other file-like objects open
unnecessarily has many downsides […]
The article’s main complaint is that “this advice is applying a notably higher standard of premature optimization to
file descriptors than to any other kind of resource”. It complains that it’s “depressingly commonplace” to see code
like this:
Sure, if it’s a one-off read of a doc file, you can almost certainly get away with just open("README.md").read(), but I honestly have no idea what’s depressing
about code that just works reliably.
Leaving files and sockets open is something that you can usually get away with, until weird stuff happens. And, no,
it’s not just about running out of file descriptors (although that does happen, too). I’ve previously written about
servers that mysteriously ran out of disk space because (spoiler) they were
deleting files that hadn’t been closed properly. Regardless of how awesome your computer and network equipment are,
the number of TCP connections you can make to a given host and port are limited by a 16 bit integer (and practically
always limited more by your network
settings), then you get network failures. Writing to files that you don’t explicitly close (or flush) is especially
dicey — the write might actually happen straight away, then on another day or another environment it might get
delayed.
Sure, these failure modes aren’t very common, but they definitely happen. I can think of three examples right now
from the past few years of my day job. Closing files as a habit is easier than figuring out when it’ll go wrong.
The article has an example of a function that lazily loads JSON given a file path string. Sure, it doesn’t work
properly if you close the file inside the function, but I’d say the problem is in the API design: the file handle
resource straddles the interface.
The first good alternative is to keep the file handle completely inside: take a path argument, load the data
eagerly, close the file and return the parsed data. The other good alternative is to keep the file handle completely
outside: take a file handle as argument and return it wrapped in a lazy JSON parser. Either way makes it easier to see
how and where the file should be closed.
The Google advice is pretty solid: when you’re done with a file or socket, just close it. I’ll add: sure, maybe it
won’t always be obvious when you’re done with a handle, but perhaps that code design is making life more “exciting”
than necessary. Production failures are more depressing than file closing code ever will be.
A few years ago I worked on the version 2 of some big enterprise’s internal website. A smaller company had the
contract, and I’d been subcontracted to deal with deployment and any serverside/backend changes.
The enterprise side had a committee to figure out lists of requirements. Committees are famously bad at coming up
with simple and clear specs, and prone to bikeshedding. Thankfully, the company I was contracting with had a project
manager who had the job of engaging with the committee for hours each day so that the rest of us didn’t have to.
However, we still got a constant stream of inane change requests. (One particular feature of the site changed name
three times in about two months.)
It was pretty obvious early on what was happening, so I integrated the existing website backend with a content
management system (CMS) that had an admin panel with a friendly WYSIWYG editor. New features got implemented as plugins
to the CMS, and old features got migrated as needed. We couldn’t make everything customisable, but eventually we
managed to push back on several change requests by saying, “You can customise that whenever you want through the admin
panel.”
So, we got things done to satisfaction and delivered, but there was one complication: using the admin panel and
WYSIWYG editor. The committee members wouldn’t use it because they were ideas people and didn’t implement anything. The
company had IT staff who managed things like websites, but they were hired as technical staff, not for editing website
content. On the other hand, they had staff hired for writing copy, but they weren’t hired as website
administrators.
So here’s how they ended up using the CMS: CMS data would get rendered as HTML by the website backend, which would
then be exported to PDF documents by IT staff. The PDF documents would be converted to Word documents and sent to the
writers via email. The writers would edit the documents and send them back to the IT staff, who would do a side-by-side
comparison with the originals and then manually enter the changes through the graphical editor in the admin panel. All
of the stakeholders were delighted to have a shiny version 2 of the website that had a bunch of new features, was
highly customisable, integrated well with their existing processes and was all within budget.
Nowadays, when I’m designing something and I think it’s obvious how it will be used, I remind myself about that CMS
and its user-friendly, graphical editor.
The Platonic ideal of OOP is a sea of decoupled objects that send stateless messages to one another. No one really
makes software like that, and Brian points out that it doesn’t even make sense: objects need to know which other
objects to send messages to, and that means they need to hold references to one another. Most of the video is about the
pain that happens trying to couple objects for control flow, while pretending that they’re decoupled by design.
Overall his ideas resonate with my own experiences of OOP: objects can be okay, but I’ve just never been satisfied
with object-orientation for modelling a program’s control flow, and trying to make code “properly”
object-oriented always seems to create layers of unneccessary complexity.
There’s one thing I don’t think he explains fully. He says outright that “encapsulation does not work”, but follows
it with the footnote “at fine-grained levels of code”, and goes on to acknowledge that objects can sometimes work, and
that encapsulation can be okay at the level of, say, a library or file. But he doesn’t explain exactly why it sometimes
works and sometimes doesn’t, and how/where to draw the line. Some people might say that makes his “OOP is bad” claim
flawed, but I think his point stands, and that the line can be drawn between essential state and accidental state.
I’m a huge proponent of designing your code around the data, rather than the other way around, and I think it’s one
of the reasons git has been fairly successful… I will, in fact, claim that the difference between a bad programmer and
a good one is whether he considers his code or his data structures more important. Bad programmers worry about the
code. Good programmers worry about data structures and their relationships.
Data dominates. If you’ve chosen the right data structures and organized things well, the algorithms will almost
always be self-evident. Data structures, not algorithms, are central to programming.
Beyond craftmanship lies invention, and it is here that lean, spare, fast programs are born. Almost always these are
the result of strategic breakthrough rather than tactical cleverness. Sometimes the strategic breakthrough will be a
new algorithm, such as the Cooley-Tukey Fast Fourier Transform or the substitution of an n log n sort for an
n2 set of comparisons.
Much more often, strategic breakthrough will come from redoing the representation of the data or tables. This is
where the heart of your program lies. Show me your flowcharts and conceal your tables, and I shall be continued to be
mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.
So, smart people have been saying this again and again for nearly half a century: focus on the data first. But
sometimes it feels like the most famous piece of smart programming advice that everyone forgets.
Back in 2001 Joel Spolsky wrote his classic essay “Good Software Takes Ten
Years. Get Used To it”. Nothing much has changed since then: software is still taking around a decade of
development to get good, and the industry is still getting used to that fact. Unfortunately, the industry has investors
who want to see hockey stick growth rates on software that’s a year old or less. The result is an antipattern I like to
call “Hello World Marketing”. Once you start to notice it, you see it everywhere, and it’s a huge red flag when
choosing software tools.
Here’s a little story of some mysterious server failures I had to debug a year ago. The servers would run okay for a
while, then eventually start crashing. After that, trying to run practically anything on the machines failed with “No
space left on device” errors, but the filesystem only reported a few gigabytes of files on the ~20GB disks.
Some months ago I showed how inheritance and polymorphism
work in compiled languages by reimplementing them with basic structs and function pointers. I wrote that code in D,
but it could be translated directly to plain old C. In this post I’ll show how to take advantage of D’s features to
make DIY inheritance a bit more ergonomic to use.
Although I have used these tricks
in real code, I’m honestly just writing this because I think it’s neat what D can do, and because it helps explain
how high-level features of D can be implemented — using the language itself.
I’ve promised to write a blog post about the DIY polymorphic classes implementation in Xanthe, the experimental game I once wrote for bare-metal X86. But first, I decided
to write a precursor post that explains how polymorphism and inheritance work in the first place.
A year ago I worked on a web service that had Postgres and Elasticsearch as backends. Postgres was doing most of the
work and was the primary source of truth about all data, but some documents were replicated in Elasticsearch for
querying. Elasticsearch was easy to get started with, but had an ongoing maintenance cost: it was one more moving part
to break down, it occasionally went out of sync with the main database, it was another thing for new developers to
install, and it added complexity to the deployment, as well as the integration tests. But most of the features of
Elasticsearch weren’t needed because the documents were semi-structured, and the search queries were heavily
keyword-based. Dropping Elasticsearch and just using Postgres turned out to work okay. No, I’m not talking about
brute-force string matching using LIKE expressions (as
implemented in certain popular CMSs); I’m talking about using the featureful text search indexes in good modern
databases. Text search with Postgres took more work to implement, and couldn’t do all the things Elasticsearch
could, but it was easier to deploy, and since then it’s been zero maintenance. Overall, it’s considered a net win (I
talked to some of the developers again just recently).