Please Remember My Dear Aunt Sally: June 2011

Lately I've been studying databases and SQL. Not long ago I purchased a large corpus of financial data (historical stock market indices), which came with a front-end application to output the numbers in various ways for easy importation into other software. What’s going on behind the scenes, most likely, is that a relational database is being queried and various views are being produced as flat tables—there's nothing surprising there. But using this software has made me want to know more about database theory, and to understand SQL better, since the latter is basically the lingua franca of databases in the corporate world and everywhere else.

Hence the question in the title of this blog entry. RDMSs are a fact of life—almost universally agreed upon as the best way to store large data collections—but the way we use them has changed: with increased computer power we are now able to query a database in far more complex ways, and far more frequently; this is what makes data mining possible. And general-purpose programming languages have changed equally radically since SQL was developed. In short: if SQL was developed to support a certain kind of database access, a kind that was not easy to do with the programming languages of the time, and a kind that is not much like the kind of access we need now, has its time passed? And if the answer is “yes,” what comes next?

One fact that supports a “yes” vote: First, not only have programming languages changed, but the environments in which we use them have, as well. The designers of SQL didn’t anticipate point-and-click interfaces, let alone the WWW; they probably expected front-end interfaces of some kind, but were thinking only of the green-screen mainframe fill-and-submit forms of the 1970s. The typical web application is a mess. It’s a Frankenstein’s monster of barely adequate tools assembled into an effective but painfully ugly whole: peek behind the curtain of the Great and Powerful Oz of your typical e-commerce site or a Web-based ERP system sitting on a corporate intranet, and you’ll see dynamic HTML full of JavaScript, presenting to the user some buttons that execute PHP code, which queries an RDMS using SQL code, which was written inline in the PHP and is then handed off to the server as a string variable. It works, but, again, it’s a crude, ugly mess. This is the reason behind the push towards more integrated development approaches for Web 2.0 applications—I’m thinking of Ruby on Rails as a particularly fine example, though as I mentioned in the previous post here, I’m leaning towards Catalyst or perhaps a Python-based solution for my own work.

Another point: the languages in which better database front-ends can and should be written lend themselves naturally to direct access to the database. I’ll anticipate objections to the point in the last paragraph by saying that SQL has changed over the past 40 years. Stored procedures are a major difference, and one that can, in part, avoid the clunkiness of having to pass every query to the database as a long sequence of string literals. Extensions like Transact-SQL make stored procedures able to process data in a much smarter way, reducing the number of individual queries the front end has to do. One result is a greatly reduced load on the network between the server and the client, which I'm not saying isn't important! But, in terms of abstraction, stored procedures, I argue, are the wrong direction. (T-SQL in particular falls far short of being a modern programming language: it looks like nothing so much as Fortran77. Wikipedia points out—damning with faint praise!—that it’s Turing-complete. A ball of yarn and a Sharpie are also Turing-complete, but that doesn't mean I'm going to write code to access a database with them.) A far better design is to build the functionality of SQL directly into the front-end language. This is a simple but incredibly powerful idea. Modern programming languages have the capability (i.e., which older languages did not) to effortlessly create and manipulate data structures of essentially unlimited complexity. There’s no reason why the programmer couldn’t specify queries and retrieve data via a library that translates native-language idioms into SQL transparently; going a step further, I don’t see why a language library shouldn’t allow access to the RDMS server directly, exposing the results of a query as a data structure native to the language, without SQL in the middle at all. Neither of these is pie-in-the-sky stuff: for implementations of the former idea, see Rails’ Active Record Query interface, Python’s SQLAlchemy, or Django. Nobody seems to have actually implemented the complete elimination of SQL yet, but there’s one brief but thought-provoking discussion here. And Microsoft's LINQ goes a long way towards breaking down the distinction between different kinds of data storage, providing a unified interface; as do the various wizards in Visual Studio 2010 which allow you to just drag and drop tables from Access into your project.

Part of the reason for the continued use of explicit SQL code constructed by a front-end app and submitted as strings, I suspect (again, sidestepping the speed issue) has to do with security. Stored procedures provide security via encapsulation; explicit code can be more easily checked for attempted exploits such as SQL injection. But the latter is an issue precisely because of the exposure of the database structure at an unnecessarily high level; and the former is what object-oriented languages are all about. (Again, a design feature that SQL needed to provide at the time of its conception but which has now been implemented with far more versatility and elegance in programming languages: OOP and other techniques preach the gospel of “maximize cohesion, minimize coupling” to improve not just design and maintainability of software but also data security.)

I want to make clear that I’m not advocating the elimination of relational databases as such. (Even though arguably most current implementations don’t really follow relational database theory in practice.) What I don’t like is the crude way that we interact with them. I want to see programmer-space access that goes beyond building SQL code with string manipulation operators, then handing it off to the database; I want to see end-user-space access that goes beyond clunky wizards that build a limited subset of SQL queries (always almost, but not quite, the one you really need). The relational model has been adopted almost universally as the standard way to store data sets of arbitrary size, for very good reasons that held true thirty years ago and still hold true today. But let’s make sure our reasons for the way we interact with the relational model still hold true as well.

Looking back at my earlier posts here I see a lot of optimism about rebooting (an irresistible pun) my ability to write software; a year or so ago, I had a plan to learn some new languages, notably Java, and refresh my knowledge of existing ones, notably C++, by embarking on a few different programming projects. It seems I’ve taken a slightly different route. I’ve done a lot of programming over the past year, but it’s almost all been in Perl. Hardcore Perl fans would say this was inevitable—that the language is so powerful and flexible that, once you start using it, there’s no need to use any other. People with a lower opinion of Perl would be more likely to call it a sickness.

What’s so seductive about Perl? A few things, to start, off the top of my head:

The astonishing variety of freely available modules (Perl-speak for libraries) for highly specialized tasks: more than a few times I’ve girt my loins for a difficult bit of coding, only to check CPAN and find that someone else has already done it for me.

Flexible built-in data structures and the ability to easily create more elaborate ones when the built-ins won’t do. Perl’s built-in lists, hashes, and references seem to capture the best features of the C++ Standard Template Library, Lisp lists, and hand-built C data structures, with performance (thanks to a highly evolved and optimized interpreter) fast enough for most purposes.

Syntax that, once a slight learning curve is overcome, feels natural and is (surprise!) easy to read. Perl’s reputation for looking like line noise comes from two things: the idiomatic use of regular expressions as the most effective way to parse string (and often numeric as well) data, and the way that non-alphanumeric characters are used to mark variable types. The former did encourage “write-only” code in Perl 4, but in the current implementation of the language, you can interpolate comments in regexps to make them as readable as you want. The latter becomes more rational once you realize that the dollar, percent, and at symbols are used to identify a variable’s type not just at declaration, but at use, in context; they aren’t just an uglier cousin of Hungarian notation but a concise expression of an extremely elegant kind of dynamic casting. There are a few rough edges—for example, the backslash and curly braces are overloaded for references (sort of like C pointers) in a way that makes me uncomfortable—but on the other hand, there are only so many non-alphanumeric characters in ASCII. Perhaps Perl 6 should move on to use some interesting Unicode symbols, or (I’m only mostly joking here) something like APL… (In the meantime, the plethora of dollar signs in Perl code gives me a bit of a warm fuzzy feeling, as it reminds me of my childhood programming in 8-bit home computer BASIC, where the dollar sign indicated a string variable.)

I still fully intend to learn Python. Even though he was contrasting immature versions of both languages, open-source guru Eric Scott Raymond’s classic essay on the elegance and ease of Python compared to Perl still makes as big an impression on me now as it did when I read it ten years ago. Perl’s become more elegant since then, and its suitability for large software projects has been greatly enhanced by the Moose project, which transforms Perl’s OOP features to such a degree that it almost creates a separate dialect of the language. On the other hand, Python has loud cheerleaders in the scientific community: my impression is that the NumPy libraries have in just a few years supplanted Fortran as the development environment of choice for numerical analysis of large data sets, something that is a critical programming task for me. And somewhere near the top of my to-do list is to learn one of the new platform-independent Web 2.0 programming frameworks; Ruby on Rails seems to be the most popular, although I privately suspect that’s partly because of its oddly catchy name—but Python-based systems are close behind.

Please Remember My Dear Aunt Sally

Monday, June 20, 2011

Is SQL obsolete?

Sunday, June 19, 2011

Perl Ate My Brain!

Archive

Links

Search

Labels

About Me