Wednesday, September 1, 2010

Moving On

As you can see, the flow of blog posts here stopped short quite a while ago. Not too early for some interesting content to appear, I hope, but the intended story arc of insights relating to database systems was never fulfilled.

I am going to round off this blog now (this will probably be the last post) without fulfilling any of my old “to be covered in future posts” promises. Sorry. Things have moved on, and I have moved on. Instead, I am going to finish off with taking a step back and making some reflections about the role of database systems in people’s minds and in the industry.

But first, I just want to explain a few things about this blog, the company that I was working for while writing the previous posts, and which I am no longer working for (Apptus Technologies), and what I am doing now and in the future.

Why This Is the End

The reason for the decline of this blog was changes at Apptus Technologies that shifted the focus for me and my colleagues a little bit. The company took on a new direction after the international financial crisis of late 2008. We decided to narrow down our scope and build something that would lie as close as possible to the final use our customers typically found for our software, rather than producing a generic platform. Hence, spreading the word of our insights in generic data management, which was the primary mission of this blog, became less of an issue.

The new direction, with the Apptus Esales platform, became quite a success, both in terms of creating a technologically strong product and selling it, and Apptus is again thriving, and hiring new software developers.

From now on, however, future development of Apptus will happen mostly without my involvement. After about ten years in the software development business, most of which as Apptus’ head of research, I have decided to give academic research and teaching another shot. In early August, I started as assistant professor at the IT University of Copenhagen. I will certainly make use of what I have learned at Apptus in my new position, and most likely my research will involve developing ideas that have come up at Apptus. Plausibly, my former Apptus colleagues will also be involved in some manner.

If you are interested in following what I do, or would like to contact me about research ideas for instance, there are several entry points for finding and contacting me on the Internet.

If you are interested in what happens at Apptus, keep track of the fairly recently reworked Apptus website. There is some blogging going on there now, and I left some texts behind that you might find there (uncredited) sometime.

Reconsidering Databases

The transition I have been going through in the last few months inspired me to make some general reflections on database systems in a wider perspective, which I thought I would share with you. I have asked myself: why do we have database systems, what do people really expect of them, why are people often so dissatisfied with them, and what directions are available for improvement?

I figure the things most database systems are intended for are the following:
  • Persistent storage
  • Secure storage
  • Organizing and structuring data
  • Complex query answering
  • Efficient (quick) query answering
  • Concurrent access
  • Concurrent update
Now, who needs all of that? Hardly anybody. That raises a few questions which I will formulate and give my answers to.

1 What are database systems mostly used for?

I haven’t researched what people use database management systems for, but I think I have enough knowledge and experience to base some thoughts on my personal impression of what most people consider to be the main task of DBMS.

I think it’s persistent storage. Period. People who don’t want data to disappear when they shut down the program (or the computer) set up a database system to hold the data. Anyhow, I think that’s what most programmers (which is what I identify as) consider the DBMS to be mostly about.

Everyone doesn’t use something that deserves being called a DBMS for this, but many do. Why, if they need only a single aspect of what the DBMS was designed for? Probably because it’s easy and safe. A DBMS is something they are familiar with how to interact with, and they trust that it won’t lose any data.

This is the reason for the object model for databases that is still out there, puzzling or even annoying advocates of the relational model. People who write object-oriented software want the objects still be there when a program comes back up after being shut down. I might not want to call that a DBMS, but I understand their position. I am not even going to say that it is always a bad idea to base an object persistence library on a general DBMS, but it does seem a bit excessive. It should be possible to create a slimmer and more efficient system designed specifically for object persistence, and I am sure there are such systems out there too.

2 What is the main point where database systems don’t deliver?

Maybe I am skewed towards things that have mattered in my own work, but I would say that at least one major point of dissatisfaction with general database systems is their lack of computational efficiency in delivering query results. We at Apptus found, like many others have, that a standard DBMS is about a factor 100 from the query performance of a well designed search engine, and that there is no principal reason why it has to be.

This is the reason why regular DBMS were thrown out of all large-scale search engines on the Internet in the early 2000s. (Well, almost all. Amazon are still using Oracle, I think, but I would presume that the Amazon search engine has evolved into something that is practically hand-written by Oracle technicians by now.)

3 Do we really need a single system to do all that a DBMS does?

Since most people are apparently primarily concerned about one, or at least just a few, of the capabilities of a DBMS, can’t we abandon the idea alltogether and split it into a number of systems, or at least components, that are specialized for those few things?

My answer to that is that I think we still need the single-system DBMS. It is true that in many cases where it used to be routine to bring in the DBMS, people have started using other things. And rightly so. That development will probably continue. But when you get into concurrency and security, everything gets entangled. I don’t see how that can be lifted out to work by itself. A secure system with concurrency capability just has to have control of everything that is done with the data.

So I think there are still businesses where you need a DBMS approximately as we know it. What one could wish for, however, is that DBMS in the future may gain a greater elasticity to satisfy different needs. For instance, it should be easier and more transparent how to sacrifice a bit of update concurrency for query speed, or improving update capabilities by reducing the capability for complex data organization. Today’s SQL systems, at the core still optimized for the needs and hardware resources of the 1980s, are not good at giving us that flexibility.

4 For what ascpects does the relational model help?

I myself have frequently tried to sell the importance of a strict data model with a firm basis in logic, and been frustrated by the lack of understanding of the connection between logic and databases that dominates the industry, and actually much of acedemia too. (Although I am less convinced now than I was a few years ago that the relational model is the perfect solution.)

But looking at the points above, which of them does a good and logical data model really solve? Actually not many. It is essential only to two of them: organizing and querying complex data. Those that can do without that can do without thinking very much about data models. Some of the other things certainly get a little simpler if you have a good grasp of logic and data modelling, but they are not strictly necessary, and not top priority.

This, I think, is reason why most programmers are so indifferent to logical data modelling, even though most of them should have the ability to both understand and appreciate the elegance of it.

That was all I had to say at this point. Not a finished theory to get oriented in the database world, but maybe a few insights that at least I personally am going to keep in mind in choosing the direction of my future research.

Tuesday, March 17, 2009

The Relational Model Is Not Logical

by Jesper Larsson

Look up the relational model on Wikipedia, and the first sentence that hits you (in the current version at least) is “The relational model for database management is a database model based on first-order predicate logic, first formulated and proposed in 1969 by Edgar Codd.” It would be nice if that were true. If the dominant model for databases had really taken its starting point in logic, the database field might not be such a mess today.

Although there is a clear connection between logic and the relational model, the assertion that the model is based on logic is at the very least debatable.

Relational model opening scrollArt by ctail (some rights reserved).

Logic and Relations

The brilliant idea at the core of the relational model is to use the logical/mathematical concept of relations as the single concept to describe data. In other models, there are typically at least two concepts: entities (or objects or whatever) and relationships (or links, connections, etc.). But in the relational model, one concept is enough.

Let us think about a typical real-life example of something that can be modeled as a relation: fatherhood. Fatherhood is a relation between two domains of objects: the set of all possible fathers and the set of all people. If we name the set of fathers F, and use X to denote an element of F (i.e., some father); and similarly call the set of people P and an element of it Y, then we can express the relation something like this:

X father_of Y

Mathematicians and logicians usually prefer single-letter symbols, and mostly write in prefix notation, so let us shorten father_of to f and write:

f(X, Y).

We call f a relation symbol, following Hodges1.

There is a point with using single-letter symbols instead of descriptive names, apart from saving space. It stresses that the symbols, as far as their use in logic is concerned, are really just that: abstract symbols that follow formal logical rules that we can define in whatever way we see fit. The meaning of the symbols is really beyond the scope of the logic. Connections between logical statements and the outside world can only be made through interpretations – attaching real-world concepts to the symbols.

For instance, we can take the statement “X is the father of Y” as the meaning of relation symbol f with arguments X and Y. Such a statement, used to attach a meaning to a relation symbol, is called a predicate. (At least according to some authors. There are other uses of the word predicate, a common one being to use predicate more or less in place of relation symbol.)

A more direct way to correlate the relation symbol f with the real world, without relying on any meaning to be evaluated by human judgment, is to provide the exact values for which f evaluates to true. This set of values is a subset of the Cartesian product F × P of the two sets F and P, and an interpretation of f can be given in the form of such a subset of F × P – a set of ordered pairs of values for which f evaluates to true.

For instance, one interpretation could be:

{ (anakin, leia), (george, george_w), (frank, dweezil) }.

There you have an explanation for the common way to define the term relation in mathematics simply as a subset of a Cartesian product.

A basic idea of the relational model is that sets such as these – relations, or extensions of the corresponding predicate – should be the only concept used for representing all data in the database, although as we shall see, it commonly is not.

The values anakin, leia, george, etc. are supposed to be objects from the “real world”, but when displaying the value like this we obviously have to use some sort of representative (in the form of character strings) rather than the actual objects (people). It is to break out of traps like this one that texts on logic sometimes contain peculiar assertions like “constants are interpreted as themselves.”

The parenthood relation is a binary, or 2-ary, relation; and its elements are pairs, or 2-tuples. In general, a relation symbol can be n-ary for any natural number n. For instance, we can write

p(X, Y, Z)

and let p be interpreted by, for instance, the predicate “X is a person who was born in city Y in the year X”, or by the relation

{ (george, milton, 1924),
(george_w, new_haven, 1946),
(frank, baltimore, 1940),
(dweezil, los_angeles, 1969) }.

This demonstrates the elegant uniform-concept idea of representing data as relations. In a model that has both entities and relationships, the person with birth city and year data would typically be perceived as an entity, while parenthood would be a relationship between records of the person entity. In the relational model, both sets of data are on the same basic form: relation. A fatherhood relation between two person object domains, and a person relation among the three domains person name, city, and year.

Where is the logic?
Photo by Peter Lindberg (some rights reserved).

Non-Logical Data

A convenient way of visualizing a relation, common in association with the relational database model, is to use a table:

parent_of
father (possible_father)child (person)
anakinleia
georgegeorge_w
frankdweezil

In the relational database model, the heading (schema), which is comprised of a name and data type for each column in the table (or in an alternative terminology, for each attribute in the relation). These metadata are not percieved as actual data in the relational database model, even though it is something that the user of the database system is expected to provide.

In addition, relational database systems typically comprise constraints of one kind of another. These are also supplied in ad hoc, non-relational, ways.

So, the relational database model breaks its own principle, that all data is represented in the form of relations, by introducing several additional forms of data. Classical logic finds no need for these special forms.

Algebra vs Logic

In his original account of the relational model,2, 3 Codd did briefly suggests that a data language based on first order predicate calculus should be developed, but, unfortunately, that did not make much of an impression on the database community. The paper also contained the embryo of another data language: relational algebra, a set of “operations on relations” which was later modified and expanded, and inspired various database languages. The languages grew more and more ad hoc and complex, and distanced from the ideas of logic. During the 1970s, the database language SQL was developed, with at least some involvement from Codd, and today SQL in various forms is the king of illogical relational languages.

Obviously, many people have noted the complexity and overhead problems of the relational systems in use today. Unfortunately, many draw the incorrect conclusion that the problems are in the relational model as such, rather than the implementations being stuck in old ways, partly caused by inadequacies buried in the SQL language. Hence, they choose to revert back to pre-relational tree-, network-, or record-oriented schemes.

Some however – particularly C. J. Date and Hugh Darwen, sanctioned by the whining but brilliant grump Fabian Pascal – have made heroic attempts to bring the relational model back to its roots. They have defined their own relational language dubbed “D”, which throws out the worst atrocities of SQL, and strengthens the declarative aspect.

They do not, however, go all the way to the elegance and simplicity of classical logic. (Although they have, as an aside in some of the Third Manifesto writings, taken the first step towards doing so with the alternative algebra “A”.) That is a pity, because expressing data in pure logical form makes for an excellent model. In logic, there is nothing special with concepts such as data types, constraints, comparisons, or arithmetic operations. They are just special cases of the basic simple constructs. Arguably, logic can provide the simplest model possible that allows defining and accessing databases, avoiding a lot of the awkwardness that comes from the directional operations of relational algebra.

Others have realized this, of course. As early as in the 1970s there was substantial research on the connection between logic and databases. Look back, and you will find pioneer work by, among others, Jack Minker and Raymond Reiter, and a lively research community. But for some reason, this was mostly not considered database research, but classified as artificial intelligence or logic programming, and database people appear to have taken little notice. The field more or less died out in the late 1990s, without having come through to the industry or database community.

Simplicity Enables Efficiency

Taking a logical approach to databases is not just an exercise in mathematical aesthetics. One of the most important points is simplifying the interface to database management core, in keeping the interface to higher levels of the system small. This takes a substantial workload off the shoulders of the system implementer.

In particular, a logical description of operations simplifies formulation of optimization tasks of the database system. Query optimization for SQL, with all its complexity, quirks, and inconsistencies, is a nightmare. Furthermore, for various reasons (which I hope to get back to in a future post), the actual operation of database applications using SQL systems are mostly not expressed in the language of the database at all. This takes the job of capable overall optimization from the level of practically impossible to theoretically impossible.

Optimization of database operations on the basis of pure logic, on the other hand, allows the system to concentrate on the actual computational complexity issues of translating logical formulae into execution plans. Not that this makes it easy, but arguably it makes it as simple as theoretically possible.

Revival?

The Claremont Report on Database Research, a document that has come out of a meeting of a number of distinguished database specialists in May 2008, mentions that “new declarative languages, often grounded in Datalog, have recently been developed for a variety of domain-specific systems.” Datalog is a Prolog-like language for database access, which dominated the database logic research during the 1980s, so this statement could be taken as an indication that a revival for logic in database access is on the horizon. I suppose our platform, Apptus Theca 2.0, can be said to be one such system, although our current interface is based on a graphical user interface rather than a language in text form.

I have not, however, come into direct contact with any such projects apart from our own, neither academic or commercial. If you are involved in, or aware of, anything like this, I would love to hear from you.

References

1 Wilfrid Hodges, Classical Logic I: First-Order Logic, The Blackwell Guide to Philosophical Logic (Lou Goble, ed.), Blackwell Publishers, 2001.
2 Edgar F. Codd, Derivability, Redundancy, and Consistency of Relations Stored in Large Data Banks, IBM Research Report, 1969.
3 Edgar F. Codd, A Relational Model of Data for Large Shared Data Banks, Communications of the ACM 13 (1970), no 6, 377–387.

Friday, November 7, 2008

Fight Features!

by Jesper Larsson

I still fervently believe that the only way to make software secure, reliable, and fast is to make it small. Fight Features.
Andrew S. Tanenbaum

“Features”, “power”, and “richness” are generally perceived as good things in computer systems, not as problems. Still, most people who use a computer have some experience of their problematic side. Quoting Wikipedia: “Extra features go beyond the basic function of the product and so can result in baroque over-complication rather than simple, elegant design.”

Photo by tanakawho (some rights reserved).

I don't expect to ever be able to convince a marketing person to use not adding any power or with the most rudimentary features only as a sales point. But I do sometimes get disappointed when I see the naive attitude that even experienced computing technology people take to new or exotic functionality. I am going to give you some examples from the field that all programmers have opinions about – programming languages – before finishing off with a little bit about the issue in other areas, such as database systems. Since most of the text is directed at programmers, it may be a little more technical than the previous posts on this blog.

Cool?

It was at a developers conference a few years ago that I had my major wake up call as to how uncommon my views seem to be among programmers. There were a number of presentations about new features in various languages and systems, and I was struck by how uncritically people received this. I particularly remember a reaction from the audience on a presentation of a new object language or framework of some kind, where apparently a core feature was that you were free to modify objects dynamically in any way, for instance by adding methods.

Audience member: Can you remove a method?
Presenter: Yes.
Audience member: Cool!

I was flabbergasted. What on earth would be the need to remove a method? To me it seemed like an extremely uncool feature!

The standard reaction to this is “So? If you don't like it you don't have to use it.” This always makes me want to jump up and down screaming in furious frustration. I am not sure if it is just because it is an incredibly stupid comment, or because it is something I would have been likely to say myself when I was a young and inexperienced programmer. I wish I could go back twenty years and tell myself off about it, but I wonder if I could have convinced myself – probably not, I was a pretty obstinate young man on these issues.

Anyway, if I did get a chance to persuade myself, there are four main arguments I would have used. All based on experience I did not have back then.

1. Blurring the focus of the system and its creator

This is rather obvious really, but it is easy to underestimate the extent to which limited time and resources keeps you from doing everything you can think of. If the developers of a system have to write and test code for fifty features that someone might find useful, there is going to be considerably less time to spend on things that are crucial for almost everyone.

In the programming language context, the time spent to make sure that the extra feature code worked properly could have been spent thinking about code generation for some common language construct. For instance, optimizing to get tight loops to run really fast.

You may say that this is a problem for the developer that creates the system, not for you who are just using it. But that is not true: it is a problem for you if you cannot get tight loops to run fast. It becomes a problem for the system vendor only if it is such a big problem for you that you stop using the system.

2. Blurring the focus of the user

I think my first realization that less possibilities increased my efficiency came when I wrote a rather large automatic proof program in Lisp, without using any looping constructs – only recursion for repetition. This severely limited my ways of expressing what the program should do. The result: I no longer had to think about how to express it! I could concentrate fully on figuring out how to solve the real problems!

I had a similar feeling moving from C++ to Java. C++ is a mess of language constructs taken from at least three different programming paradigms. You can write C++ code in pretty much whichever way you like, and specify everything in detail. This makes me change the code back and forth, unable to decide. “This method should probably be declared virtual. No, wait, I can probably make it static – that will save a few machine instructions. No, sorry, it has to be virtual. But if I change this call I can make it static.

Java is far from a perfect object-oriented language, but at least it is much stricter than C++. If you write Java without thinking in objects, your code tends to kick and squeal until you get it into a reasonably attractive structure.

Working on Ericsson's WAP gateway I was forced back into C++, but then a strict set of coding rules was imposed on me. The rules said things like “All classes which are used as base classes and which have virtual functions, must define a virtual destructor.” Again, this narrowed down the choices, and made it much easier to create reasonably well-structured code. Of course, it would have been nice if the compiler would have assisted in enforcing those rules.

Photo by Ramchandran Maharajapuram (some rights reserved).

“Syntactic sugar” people call constructs that let them say the same thing slightly more concisely. Saving you from typing the same thing over and over is obviously nice, but what you gain in typing you may lose to the distraction of having to think about which syntax to use. Syntactic sugar may have a smooth taste, but too much of it erodes your programming teeth.

3. Others will use it

If you are a young student or hobbyist creating one little program after another for fun or examination, you may think that software development is just about writing code. But once you work in a reasonable large project, or come in contact with running code that needs to be maintained, you realize, first, that you are not in complete control of how the code is written and, second, that much of the work is reading code.

Hence, not using a construct when you write your own code does not save you from being bothered by it. In order to read what others have written, you need to understand what everything means. The more features a language contains, the more difficult it is to learn to read. The more different ways of doing the same thing it allows, the more different other people's code looks from yours – again making it difficult for you to decipher.

One way of warding off criticism against potentially harmful features is that they are no problem when used correctly, with disciplin. True as that may be, not everyone who writes the code that you come in contact with has exactly the same disciplin, values, and experience as you do. Losing the possibility of carefully using a feature yourself is often a small price to pay to prevent others from using it hazardously.

4. Optimization

I already mentioned that adding features may have optimization suffer because developers have less time, but there is one more issue concerning efficiency: the possibility of some features being used may keep the compiler from making assumtions that would allow it to generate considerably more efficient code.

An old example from the C language is the option to modify all sorts of variables through pointers. A neat trick in some situations perhaps, but it makes it difficult for the compiler to know if it can keep values in CPU registers or not. Even if the program does not contain any such side effects, that may not be known in compile time. Hence, to make sure that the program runs correctly, values have to be repeatedly written down to main memory and read back, just because of the existance of a rarely used feature of the language.

Another example is when the assembling of program components is highly dynamic, such as with dynamic class loading in Java. As I mentioned in a previous post, I and my colleagues have been quite impressed with the optimization capabilities of Sun's HotSpot virtual machine. It has allowed us to produce high-performance systems completely written in Java. But there are some situations where optimization is less effective. Particularly, inlining does not happen as often as we would like it to. In part, this is due to the general difficulty of figuring out which method really gets invoked in languages that support polymorphism. But it is made worse by dynamic class loading, because the system can never be sure if execution is going to involve code that has not been seen yet.

Polymorphic method invocation and dynamic class loading are wonderful features. I would not want to lose any of them. But they do have their drawbacks for efficiency. They make the optimizer's task more difficult, even in situations where they are not used.

Not Just About Programming

I am obviously not the first to have noticed the feature creep problem, and it is certainly not limited to programming languages. It is everywhere in computing. The opening quote from Tanenbaum is about operating systems, and the problem is huge in database systems.

For example, the more recent versions of the SQL standard is virtually impossible to implement in full. It is well known that this “richness of SQL”, as Surajit Chaudhuri diplomatically (or perhaps sarcastically) put it in his recent Claremont presentation, is a major obstacle for efficiency in database systems.

It is even an issue for data models themselves. But in that area, there is hope of some progress in the right direction – towards simplicity. More on that in future posts.

Thursday, October 30, 2008

Models and Efficiency, part 2: Databases

by Jesper Larsson

Databases is one of the few areas in computing where there is a noticeable awareness of models. At least, models are talked about. But actually quite few of the people who work with database systems have more than a very vague view of what data models are or what purpose they serve.

Photo by Marco Arment (some rights reserved).

I explained my view of what a model is in the previous post. A data model then, is a model for dealing with general-purpose data. Data as such is not that common in most computer users' experience these days. People usually have no need to think of the data that carries the information they work with, and systems are kind enough not to bother the users with the low-level concept of data. Only systems that deal with any kind of data, regardless of what it represents, need to include the notion of data in itself in the model they expose to the user. Particularly, data is a concern for database systems – which support intricate operations on general-purpose data. Hence the importance of data models in this context.

The Dominating Data Model

Out of the plethora of data models in use by database systems (some more vaguely defined than others), the most popular one, to the extent that we can skip all the others for the time being, is the relational model. Or perhaps I should say subset of the family of relational models – for there are several. Even the instigator of the term, E.F. Codd, did not present a single coherent model, but seems to have invented a new slightly different variant practically every time he presented it. For this post, however, it will do to to talk about the relational model as if it was one.

An average database professional trying to explain the relational model typically starts by stating that in the relational model, data are stored in tables. Not a good start. First of all, tables do not really belong in the model, they are visualizations of relations – but that is a minor point.

More importantly, the model says nothing of storage! It acts as an interface, for direct users and for surrounding systems that interact with the database. Behind the scenes of the relational model, the system can deal with storage in whichever way it pleases – or whichever way the database administrator configures it. There are endless possibilities for data structures and access methods

Photo by Phil Aaronson (some rights reserved).

The implementers of early database systems, such as System R, chose to use a simple one-to-one mapping between the model and the physical storage. It was a natural choice as a first attempt. What they were doing was trying things out, not creating the ultimate system.

But somehow, the idea of a simple connection between model and storage got stuck in people's minds, and in the major database systems on the market.

Logical–Physical Separation

Recall the inverted text example from a previous post. Here is a variant with the less interesting words removed, and with only the document numbers where the words appear:

Word Doc. no
art 1
art 3
war 1
war 2
modern 3
peace 2

Many people would say that this is an inefficient representation, because it contains multiple copies of the words that appear in more than one document. To make the representation more compact, they might try using a list of document numbers as the second column, perhaps specified as a “LOB” – a large object.

This is exactly what major database vendors do for their text indexes, which is one of the many choices they make to go against the basic idea of the relational model. Unnecessarily so.

Because, again, the model is not the storage. It is the user interface. The relational structure is not necessarily the same as that in the physical representation. The following figure illustrates how the logical model representation (at the top) can be tied to a compact physical disk storage (bottom).

This is just one possible representation, of course. Ideally, the physical representation should be chosen so that it makes critical operations efficient in the particular application where the database is to be used. The relational model still lets users access data in any way they please, but queries that the chosen implementation is not directly suited for may be execute less efficiently.

A problem with the basic design of major relational database systems is that they do not have enough information to make a good choice of representation. It is impossible for them to obtain the information about the application that they would need.

The Basis of Misunderstanding

People get the wrong idea about the role of the data model partly because they start using systems without learning the basic theory, but another reason, which I consider more critical, is how the user is forced to interact with the system in order to get reasonable performance. (By user in this case, I mean the person who uses the database system for implementing an application.)

In a perfect world, the way it is supposed to work is that the user provides logical definitions of the data, and the system somehow magically chooses the best implementation. This is a tough task, not least because the system must choose the implementation before it has had a chance of getting any feedback as to its use. In practice, this simply does not work.

Hence, the user, who does have an idea about which operations need to be efficient, needs to convey more information to the system than just the logical definitions. (Or move to a specialized database system for the application, if there is one.) And the only way that the user can interact with the system is through the relational interface!

This has two effects. First, database system vendors add physical aspects to the relational interface that are really not relational or logical at all. SQL is full of this – indexes for example. There are no indexes in the relational model; they belong in the implementation.

Second, the user is led into using implicit couplings between the relational definitions and the physical storage for improving performance. Denormalization is a blatant example. Relational theory tells us that that the relational structure of the database should be normalized to avoid integrity problems. Yet, practitioners choose to do the opposite because it can give them a performance advantage.

The result of all this is that it is difficult to spot the actual relational model in the major relational database systems.

Throwing Out the Model

People have started to move away from relational database systems, mainly because of performance issues. A range of more specialized systems for narrower applications is emerging. I have no principle objection to that; using the same database system for everything is no end in itself.

However, I am convinced that there will still be a place for systems that are general-purpose enough to explicitly expose data to the user, without narrowing down what the data may represent, and accept complex specifications on relationships and retrieval operations. For this, a simple logical model that represents data as relations is extremely powerful.

Unfortunately, the relational model has become so closely associated with SQL that people tend to discard the model because they are not satisfied with SQL. The end of the relational era is proclaimed, when it should really be just the end of the SQL era.

I agree with the relational evangelists that the reason why SQL databases are so displeasing is not that they are relational. Rather, one of their shortcomings is that they are not really relational. However, I am not convinced that the third manifesto systems that the true relational lobby brings forward can take their place. The relational model itself does have some drawbacks, which I leave as a subject for future posts.

Friday, October 24, 2008

Practical Stuff: Models, Theories, Abstractions

by Jesper Larsson

They are usually mentioned only in a few fields, but models are actually everywhere in computing. In fact, it is the key concept that lets computers do things for people to find interesting.

The word model can mean many things, some of which are shown in the collage below. I am not talking about any of those. Nor am I talking about the technical use in logic (an assignment of constants to variables that makes a sentence true), despite the close tie between logic and computing – particularly databases.

Models Photos by Nguyễn Thành Lam, Erik Hersman, Richard Smith, MaZzuk (some rights reserved).

No, I am talking about a model in the sense of an abstraction. A simplified representation of something. A framework for naming and envisioning non-physical objects to allow reasoning about them and shaping theories about them. Often by use of metaphors – analogous concepts picked from a well known domain.

The picture below is an example of a sort of abstract model that most computer users encounter. It shows a set of folders, or files, which in turn are the contents of another folder. But of course, they are not really folders. Folders are made of paper and kept in file cabinets. These are just images on the screen that symbolize something that we think about as analogous to real folders.

Folders The physical constructs beyond the abstraction is not something the average user needs to be concerned with. In fact, the folders can, and do in this case, represent quite different things. Some of the folders in the picture are things that are kept on my local hard drive, but some are network connections. Since they, from a certain aspect, all support the same set of operations, we can interact with them using the simplified view that they are a homogeneous set of objects. That is part of the model of this file browser interface.

Computers operate through level upon level of models, starting (unless you want to get metaphysical) with voltages in electrical circuits, progressing to boolean logic, machine language, and so on, and perhaps via one or several virtual machines, up to, finally, the model that the user of the program understands.

Intuitive Models

Models were not invented for computing. Everybody thinks intuitively in levels of abstraction. Generally, people have no problem accepting that if you look under the surface of the simplified concepts we accept as objects or concepts – things of various kinds – you find a more fine-grained physical level of existence, which is quite different from what we usually perceive.

But strangely enough, some people, computing professionals in fact, sometimes fail to understand that the model can be independent of the lower physical level. I have suffered from this myself, and perhaps I still do in some areas, but I have gotten a lot better over the last few years.

Programming Languages

When I have told people that I develop high-performance systems in Java, I have often (less now than a few years ago) had the surprised reaction “but Java is so slow, how can you get it fast enough?”

Java is a language which, like other programming languages, expresses sequences of logical (in a wider sense) operations. How could a language possibly be slow, or fast for that matter? Consider the following Java code that calculates a number in the Fibonacci sequence:
    long s = 0, t = 1;
    for (int i = 0; i < n; ++i) {
        long u = s + t;
        s = t;
        t = u;
    }

Now, take a look at the equivalent code in C++:
    long s = 0, t = 1;
    for (int i = 0; i < n; ++i) {
        long u = s + t;
        s = t;
        t = u;
    }

Which one is faster? How can any of them be faster? They are exactly the same! (Java and C++ are closely related languages with the same basic syntactic elements.)

Now, perhaps you think I am just being difficult. Obviously, what people mean when they say that Java is slower than C++ is not that the language per se is slow. It is that code compiled with a Java compiler and executed in a Java runtime environment is slower than the same code compiled by a C++ compiler and run by the operating system directly. This is because the Java runtime environment is a virtual machine – an interpreted system. It is a program that looks at one instruction at the time and changes its state accordingly, just like the physical machine that the operating system is running on. This extra level adds overhead.

Is this true, then?

It certainly was true – about ten years ago. But then just-in-time compiling came into use. Virtual machines with this feature did not execute instructions directly, but compiled them to machine code the first time it saw them, and fed the machine code into the physical machine. This could get almost as fast as compiling directly, but it still had the overhead of compiling in runtime.

Finally, the HotSpot virtual machine arrived. It did not just compile everything it saw. It gathered statistics about which parts of the code was run more often, and made a greater effort to produce efficient code for the more frequent parts.

About five years ago, when I got into a discussion with someone who claimed that Java was inevitably slower than C++ for tight loops, I did a comparison between Java and C++ execution, on tight-loop code similar to the Fibonacci calculation code above. The Java code executed in the HotSpot virtual machine was significantly faster than the C++ code compiled by GCC with full optimization and run directly on the physical machine.

The explanation is that the Java environment, doing the compiling in runtime, has more information at its disposal than the C++ compiler. Hence, it is able to generate more efficient machine code.

This was (and still is, I imagine) astonishing to a lot of people, stuck with the interpretation overhead stigma.

The point of all this is that the language – the model, so to speak – does not determine whether executing code is efficient or not. The language is just a way of expressing transitions from one machine state to the next. What ultimately should matter for efficiency is the computational complexity of computing the next state. There is, in theory, absolutely no reason why code expressed in one model should be less efficient than another. If the system is smart enough, it can execute it in the most efficient way anyway. This is the true meaning of the word optimization.

What is called optimization in practice is not really optimization, because it rarely finds the optimal code. It just finds code that is, hopefully, better than a naive translation from high level language to machine code. I suppose true optimization in general is impossible. (I have not found an actual proof, but it feels like it would be related to computing Kolmogorov complexity.) However, in every case where the system can recognize a piece of code that expresses something it has an efficient execution method for, it can use that method.

The evolution of Java systems is an excellent example of how an improved implementation can increase performance without changing the model.

Next Time: Data Models

The main point that I want to make about misunderstood models has, as you might expect (and as promised at the end of the previous post), to do with database systems. But this post is already quite long, and I do not want to cram in the most important subject at the end. You may already have guessed the essence of it, but for the details I again have to refer you to a future post.

Monday, October 20, 2008

Text Search and the Relational Model

by Jesper Larsson

About seven years ago, I gave a presentation to a large group of potential customers about the text search product that was our big thing at the time. I explained that our storage was more compact than in the relational model. I was so wrong.

Now I know better. Unfortunately, most of the industry and much of academia does not. People get away with saying the things that I did without being laughed at, just as easily as ever.

Not understanding what the relational model really is, people take it to be what the major relational database systems do. Hence, when people cannot get what they need from those systems, they conclude that the relational model is not suitable for their needs. So, they throw away one of the most powerful abstractions for data processing without even looking at it.

It is particularly common to write off the relational model in the text search field. You can frequently hear people say that things like the relational model is not suitable for capturing the structure of text, or referring to text by the peculiar term “unstructured data”. Some writers talk about combining text and relational data, as if there were a contradiction. As if relational data were a special kind of data. A more correct account, like that of Curt Monash, is to simply note that text does not work very well with the architecture of mainstream relational products – which is true. Text search applications based on major database systems usually turn out painfully inefficient. One of many points where the architecture of those systems does not match the demands of the Internet world. Database system vendors are unable to adopt, which has created a separate market for search engines.

But that avoids the core issue. Since text search is one of my top areas of expertise, I hope I can explain to you why the relational model is perfectly capable of capturing the structure of text. I'll start at the very bottom, explaining what text search really is.

Searching for Words

Boiled down to the most fundamental formulation, the text search problem is to answer the query “which documents match the text I give you?” The result is a set of document identifiers of some kind – titles, file names, internet addresses or something, depending on the application. Let us assume, for simplicity, that each document has a unique number that you use for identifying it.

Then you might have, for example, a database such as this one:

Text Doc. no
war and peace 1
art of war 2
modern art 3

The text data here may not be something that you would normally consider a document, but that is just because I want to keep the example small. The same principles apply if the texts are hundreds, thousands, or millions of words long, and if if there are millions of documents.

Say that you want to search for a single word, war for instance. Scanning through all the text of all the documents would probably not be efficient enough for a real-life database, so you need some sort of data structure to make searching a quicker task.

A method that you are sure to be familiar with, even if you know nothing about computing, is to line up items alphabetically to allow searching for something without looking through all the data. A computer program can use binary search to locate something quickly in an ordered list, or it can build a search tree that allows the same sequences of comparisons without having to physically align the sorted items. (One out of a bunch of standard search methods at the programmer's disposal.)

When you create an index for a data field (a table column) in a common relational database system, this is essentially what happens. Most commonly, some sort of B-tree is created, and subsequent requests for records based on that field can use the tree to speed up finding the right records.

So, an index for a text field can be visualized as another table, where the rows are ordered alphabetically:

Text Doc. no
art of war 2
modern art 3
war and peace 1

This lets us find document no 1 quickly by looking for war in the ordered text field. But what about document 2? It contains war as well, but the index is no help since the word is not at the beginning.

This is where common database systems give up and tell you to stop using the relational model. Instead, they supply you with a special kind of text index, that can be invoked either with a special operator to specify that a record contains a word, or using a hierarchical programming interface.

You may find it a bit surprising that they give up so easily, because it should be obvious to almost anyone, including major database vendors, that the standard text search practice of inverting the data can be applied like this:

Word Doc. no Position
and 1 2
art 2 1
art 3 2
modern 3 1
of 2 2
peace 1 3
war 1 1
war 2 3

This table tells us, for each word, in what position in what document the word can be found. In fact, it conveys exactly the same information as the original table. The difference is that with this organization, it is simple to express queries such as “which documents contain the word war” using just a normal relational language. No special operators, no non-relational interface.

In addition, unlike a specialized text index, it allows you to express a bunch of other queries about the words and the documents. This is one of the strengths of the relational model: it uses a small, fixed, set of operators, but still lets you formulate practically any query that you can think of.

Say that you want to find which documents contain both war and art. In a relational language, this is easily expressed as the intersection between the documents containing war and those containing art. (There is actually also a more intriguing way to express it that I will get to that in a future post.) By contrast, if your word query construct is a special contains operator, you need to add support for a more complex argument than just a word. For instance, you could say something like “contains(war and art).” The argument string, “war and art”, is something that the implementation of the contains operator has to parse and make its own kind of sense of. In effect, this means that you have to have a whole separate query language for the contains argument alone!

This should really be enough to convince anyone that the relational model is suitable for capturing the structure of text.

To be fair, we should note that there is no way to obtain word-oriented organization from the document-oriented one using standard relational syntax. This is because it depends on a particular interpretation of the text field in the original table – that it consists of space-delimited words. So you either have to present the system with the word-oriented data as input, or include some way of specifying how the text field is to be broken down to multiple word values.

Model vs Implementation

Indexing by inverting text data is the basis of pretty much all information retrieval. It is how the text indexes of the database products works below the surface, and it is the underlying method for most search engines you find on the Internet – large and small. (Although there is a rumor, supported by a peculiar change of behavior, that Google recently changed their basic search method to something completely different.)

So why don't the major relational database systems expose their text indexes in a relational way? The answer lies in the core problem shared by all those systems, as far back as System R: the lack of separation between that which is logical and that which is physical, between model and implementation.

That is a subject that deserves to be treated as something more than a side note. It will be the subject of an upcoming post.

Friday, October 3, 2008

Why the Database Masters Fail Us

by Jesper Larsson

There is hardly any field in computing more plagued by religious wars than database systems. For nearly 40 years, the battle has raged between various architechtures, occasionally with some combattants replaced – or at least renamed. Still, it seems that we are further away than ever from a sort of database that we can be satisfied with. (I am not going into the details of the problems right now, that will be a subject of future posts. But if you are in the business I am sure you are familiar with some of them.)

Let us take a couple of steps back and take a look at the situation. Let us forget, for the moment, our personal stance in the fight on data models, platforms etc., and ask ourselves: who do we depend on to design database systems? What is driving them? Could they do better? Could we do better, with a different kind of effort?

I have come up with three groups of operators that influence the design of database platforms. I call them the paper writers, the evangelists, and the merchants. Anyone who creates the actual code that makes up the database systems is a servant of one or more of these masters. Let us go over them one by one.

The Paper Writers

What would be a better source of knowledge to base our system design on than science? According to popular view, scientists, or more specifically academic researchers, have the task of advancing human knowledge. Using objective observation and rational reasoning, they find the ultimate truth, unaffected by fads or short-sighted economics.

Unfortunately, it does not quite work that way.

The primary concern of most academic researchers is to produce papers – articles published in conference proceedings or scientific journals. Published papers are what they are judged by, the most important merit for their Ph.D. degrees and research grants.

Consequently, academic researchers learn to become experts on getting papers published. They have an ambition to advance human knowledge too, but that is a far-fetched goal that only established stars can afford to have as their first priority. In daily work, the immediate focus of most researchers is to impress peer reviewers – people in their own field who decide whether papers get accepted or not.

For several reasons, this makes scientific work less useful for practitioners than you might expect.

First, it influences which subjects get explored. Researchers tend to pick subjects that are currently in fashion, for which papers are in demand. The result is that current trends have a large impact on what subjects people choose to work with.

Second, it has an effect on the language of published research. Since writers and reviewers are actually the same people – the researchers involved with the field – writers use language meant to be understood and judged as appropriate by other people like themselves. This has a self-amplifying effect, with the result that publications often seem impenetrable or irrelevant to people outside the field.

Third, once the papers are published, the work is finished as far as the researcher is concerned. Few researchers bother to take their findings any further.

The consequence is that academic research rarely gives us comprehensible knowledge of how to design a system. What we mostly get are thousands of fragments of potentially useful knowledge, clumped together in bursts around subjects that are popular over a few years, and usually presented in a language that is difficult to penetrate.

Many research projects include implementation of actual software systems to test or demonstrate research findings. Sometimes they develop into industrially useful ones, but only rarely. Research systems are typically not fully functional or efficient enough for general practical use, and it would hardly be fair to expect them to be, especially for software as large and complex as modern database systems are. After all, the point of academic research is not to produce ready-made systems.

The Evangelists

There are a number of people out there who claim that they have the correct view of how a database should be constructed, and that essentially everyone else is wrong. If people would just listen to them, everything would turn out fine. Of course, oddballs, fanatics, and charlatans exist in every field, but none of these labels quite captures the database evangelists – at least not all of them.

In particular there is a group of people who persistently promote the relational model. But, the relational model is already dominant in the database world, isn't it? Is not everything fine, then? No, this true relational model lobby, with esteemed relational database pioneer C.J. Date as their figurehead, claims that the version of relational databases that dominates the industry is distorted; that the relational model is misunderstood or mistreated by practically everyone, even to some extent its inventor E.F. Codd!

I may have made this sound a little more eccentric than it deserves. The fact is, I principally agree with most of what Date and his allies say. However, even if they are correct about the data model, they do not have all the solutions needed to create a full database system. In their fervor to promote the true relational model, there are a number of problems that they de-emphasize or do not address at all. Hardware utilization and efficiency, for instance, they write off as somebody elses problem.

On the other hand, an extreme pragmatic lobby has recently emerged, centered around Michael Stonebraker, another veteran of the database field. Although Stonebraker's thesis that it's time for a complete rewrite of database products could plausibly be supported by C.J. Date, Stonebraker could not care less about purifying the relational model. He currently endorses abandoning the very idea of general-purpose database systems for specialized, application-specific solutions.

The database evangilists have an influence through their writings as well as through their contacts with implementation projects in academia or in the industry. Most of the time, their direct impact is minor, but they may have important roles as architects of future systems.

The Merchants

The vendors that make their living producing database systems obviously want to sell their products or services to as many customers as possible. This makes them keenly monitor what the market seems to want, and declare that this is just what they have – sometimes adjusting their products accordingly.

On the one hand, merchants tend to be conservative, at least on the main issues. They are frightened by radical new ideas that threaten to be costly both to them and to their customers. Database management platforms are heavy components in most IT infrastructures, coupled with large investments and legacy issues. Rather than improving their systems at the core, merchants prefer to add peripheral components, covering more and more of customers' software needs.

On the other hand, the merchants must be extremely sensitive to trends. They have to keep up with the latest buzzwords, not to appear to be falling behind the competition. Also, they are always on the lookout for new features – minor extensions that they can use as selling points.

The result, when merchants get their way, is that once their product has established a decent market share, it more or less stops developing, and starts growing instead.

Who Will Create the Perfect Database System?

As you can see, if you roughly agree with my outline of the operators, nobody really has both the will and the capability to create a good database management system. How is this different from other areas of the software industry? Simply because a database platform is such a monumental piece of software, not only to produce, but also to use. Changing your database system can be more demanding than replacing your operating system. To both create and sell a major product is a gargantuan task.

It is unlikely for any company or academic institute to successfully take on the task of designing and producing from scratch, and then selling to the market, a database system that is different enough to take a major leap forward. It can happen – it has happened before – but the way the industry has developed, it has become a lot more difficult since the last major shift with the relational database breakthrough in the early 1980s.

More likely, new systems will evolve slowly. People will abandon the monolithic one-size-fits-all database systems, as Stonebraker predicts, and create specialized lightweight solutions for various applications. This has already happened in some areas.

However, I am convinced that there is a place for general data management platforms in the future. It is simply too much work for everyone to roll their own.

So What am I Selling?

Who am I to say all this then, and what is my interest in the business? I am head of research at a medium-sized Scandinavian company named Apptus Technologies, which started out offering services around existing major database systems. We suffered from the sluggishness of the platforms, adopted to it, and ultimately made our business from it.

For a number of years we have produced search systems for large and complex data sets, to be accessed over the internet by millions of users. We achieved efficiency and flexibility by gradually moving away from major database platforms. Ultimately, we created our own full-fledged database management system. We have never sold it as a standalone system (nor do we intend to in the foreseeable future), only used it as a base for more specialized systems. Therefore, we have enjoyed a freedom in choosing how to develop our platform that database system vendors do not have – including the freedom to change our minds.

We have learned a lot during the last eight years, gained a lot of insights, and developed some strong opinions in the process. Now, we have decided to stick our heads out of the laboratory a bit, and try to start a conversation with the world outside. Not just about the advantages that our technology can produce for our customers (we have a sales department for that), but about the technology and ideas behind it, as well as our visions for the future of information systems. It is going to be a lot about databases, but also about programming and computer science in general.

I hope you will join us. Stay tuned to this blog. Subjects coming up: why you can benefit from rewriting your system rather than patching it; how we started out misunderstanding relational databases, and why most people still do.