Monday, June 20, 2011

Is SQL obsolete?

Lately I've been studying databases and SQL. Not long ago I purchased a large corpus of financial data (historical stock market indices), which came with a front-end application to output the numbers in various ways for easy importation into other software. What’s going on behind the scenes, most likely, is that a relational database is being queried and various views are being produced as flat tables—there's nothing surprising there. But using this software has made me want to know more about database theory, and to understand SQL better, since the latter is basically the lingua franca of databases in the corporate world and everywhere else.

Hence the question in the title of this blog entry. RDMSs are a fact of life—almost universally agreed upon as the best way to store large data collections—but the way we use them has changed: with increased computer power we are now able to query a database in far more complex ways, and far more frequently; this is what makes data mining possible. And general-purpose programming languages have changed equally radically since SQL was developed. In short: if SQL was developed to support a certain kind of database access, a kind that was not easy to do with the programming languages of the time, and a kind that is not much like the kind of access we need now, has its time passed? And if the answer is “yes,” what comes next?

One fact that supports a “yes” vote: First, not only have programming languages changed, but the environments in which we use them have, as well. The designers of SQL didn’t anticipate point-and-click interfaces, let alone the WWW; they probably expected front-end interfaces of some kind, but were thinking only of the green-screen mainframe fill-and-submit forms of the 1970s. The typical web application is a mess. It’s a Frankenstein’s monster of barely adequate tools assembled into an effective but painfully ugly whole: peek behind the curtain of the Great and Powerful Oz of your typical e-commerce site or a Web-based ERP system sitting on a corporate intranet, and you’ll see dynamic HTML full of JavaScript, presenting to the user some buttons that execute PHP code, which queries an RDMS using SQL code, which was written inline in the PHP and is then handed off to the server as a string variable. It works, but, again, it’s a crude, ugly mess. This is the reason behind the push towards more integrated development approaches for Web 2.0 applications—I’m thinking of Ruby on Rails as a particularly fine example, though as I mentioned in the previous post here, I’m leaning towards Catalyst or perhaps a Python-based solution for my own work.

Another point: the languages in which better database front-ends can and should be written lend themselves naturally to direct access to the database. I’ll anticipate objections to the point in the last paragraph by saying that SQL has changed over the past 40 years. Stored procedures are a major difference, and one that can, in part, avoid the clunkiness of having to pass every query to the database as a long sequence of string literals. Extensions like Transact-SQL make stored procedures able to process data in a much smarter way, reducing the number of individual queries the front end has to do. One result is a greatly reduced load on the network between the server and the client, which I'm not saying isn't important! But, in terms of abstraction, stored procedures, I argue, are the wrong direction. (T-SQL in particular falls far short of being a modern programming language: it looks like nothing so much as Fortran77. Wikipedia points out—damning with faint praise!—that it’s Turing-complete. A ball of yarn and a Sharpie are also Turing-complete, but that doesn't mean I'm going to write code to access a database with them.) A far better design is to build the functionality of SQL directly into the front-end language. This is a simple but incredibly powerful idea. Modern programming languages have the capability (i.e., which older languages did not) to effortlessly create and manipulate data structures of essentially unlimited complexity. There’s no reason why the programmer couldn’t specify queries and retrieve data via a library that translates native-language idioms into SQL transparently; going a step further, I don’t see why a language library shouldn’t allow access to the RDMS server directly, exposing the results of a query as a data structure native to the language, without SQL in the middle at all. Neither of these is pie-in-the-sky stuff: for implementations of the former idea, see Rails’ Active Record Query interface, Python’s SQLAlchemy, or Django. Nobody seems to have actually implemented the complete elimination of SQL yet, but there’s one brief but thought-provoking discussion here. And Microsoft's LINQ goes a long way towards breaking down the distinction between different kinds of data storage, providing a unified interface; as do the various wizards in Visual Studio 2010 which allow you to just drag and drop tables from Access into your project.

Part of the reason for the continued use of explicit SQL code constructed by a front-end app and submitted as strings, I suspect (again, sidestepping the speed issue) has to do with security. Stored procedures provide security via encapsulation; explicit code can be more easily checked for attempted exploits such as SQL injection. But the latter is an issue precisely because of the exposure of the database structure at an unnecessarily high level; and the former is what object-oriented languages are all about. (Again, a design feature that SQL needed to provide at the time of its conception but which has now been implemented with far more versatility and elegance in programming languages: OOP and other techniques preach the gospel of “maximize cohesion, minimize coupling” to improve not just design and maintainability of software but also data security.)

I want to make clear that I’m not advocating the elimination of relational databases as such. (Even though arguably most current implementations don’t really follow relational database theory in practice.) What I don’t like is the crude way that we interact with them. I want to see programmer-space access that goes beyond building SQL code with string manipulation operators, then handing it off to the database; I want to see end-user-space access that goes beyond clunky wizards that build a limited subset of SQL queries (always almost, but not quite, the one you really need). The relational model has been adopted almost universally as the standard way to store data sets of arbitrary size, for very good reasons that held true thirty years ago and still hold true today. But let’s make sure our reasons for the way we interact with the relational model still hold true as well.

No comments:

Post a Comment