The Reification of Data: from the Printing Press to the Web-centric API

21 Feb 2011 // science

In the writings on the history of science, nothing comes as close to explaining its origins as one stunning chapter in Elizabeth Eisenstein's The Printing Revolution in Early Modern Europe. There Eisenstein argued that the invention of printing, created whole new categories of data transmission, and opened up new realms of thought. In that explosion of mental space, modern science was born.

I'm beginning to think that something equally earth-shattering is happening right now due to a certain piece of technology. And it isn't the transmission of articles through PDF or the electronic transmission of terabytes of data – I think these kinds of technologies are only logical extensions of what has come before. The technology that is fundamentally changing the nature of scientific data, allows data to live in a way that it's never done before, and will open up new ways of interaction. It is the web-centric API, otherwise known as the Application Programming Interface.

Comparing Cheap & Accurate Books

But first, let's go back to the printing press. How exactly did it change thinking about the natural world? It is hard to overstate the difficulty of obtaining books before the printing press. They were expensive because they were so hard to manufacture. Because they were hard to manufacture, they were also exceedingly rare. As each copy of a book had to be laboriously copied out by hand, there were few precious copies of each book. If one wanted to read a certain book, one often would have to travel very far to the library where the book was held, sometimes across continents. And that's assuming you were known to the owner of the book.

But above all, before the printing press, books were just plain inaccurate. Books copied out by hand tend to be riddled with errors. Diagrams, figures, drawings varied enormously depending on the drafting skills and artistic temperament of the copier.

Eisenstein argued that modern astronomy – the first of the sciences to be mathematicized – was born precisely when the printing press allowed it to be born. She argues that Galileo was lucky to be alive when the explosion of printed books allowed him to make the momentous step to the modern theory of planetary orbitals. The evidence: as a young man, Galileo owned one tattered and badly translated hand-copied version of Ptomeley's "Almagest". By the end of his life, Galileo had, not one, but three different translations of the Almagest, printed in high-quality and very accurately reproduced.

The invention of printing allowed scholars, for the first time in Western history to study, at their leisure, all the ancient texts, as well as modern texts. This made it practical for them to compare and contrast, to look for inconsistencies, errors and elusions. Copernicus was the first astronomer in history to have in his hands, all the great works of astronomy from Euclid to the Arabs, translated.

Figures get Stabilized

Not only did printing allow a Renaissance scholar to compare and contrast ancient theories, but the very notion of a reproducible figure was invented, allowing an unprecedented expansion of visual communication. Before the printing press, the reproduction of figures was a haphazard process where accuracy was not thought possible.

To follow an example of the impact of printing of figures, Eisenstein examines the development of cartography. Before the printing press, maps were quirky unique artifacts, difficult to manufacture and impossible to copy accurately. One could only rely on maps drawn by the mapmaker who had direct experience of the areas that they had navigated. Maps drawn by different navigators were widely different. However, since they were so hard to reproduce, before the printing press, few people would have had the opportunity to see these differences, let alone resolve them through careful comparison.

But once the printing press came along, the notion of a standard map quickly arose. Different maps were easily reproduced, accurately, but more importantly with the proliferation of cheap accurate maps, many people could spot inconsistencies and errors across different maps. The notion of a standard map arose with this mass movement of comparison.

The invention of the printing press allowed the creation and dissemination of standard figures, diagrams and instructions. This accelerated the construction of instruments, the explanation of detailed experiments, and the creation of disciplines that rely on complex illustration such as modern anatomy through the publication of Vesalius's "On the Fabric of the Human Anatomy".

Tables of Data set Loose upon the World

Ian Hacking that great Canadian philosopher of science made the point that it is no accident that physics was the first modern science. It is because physics grew from astronomy and the measurements needed for astronomy – the positions of the planets and the stars – were simply just lying there in the sky for anyone to measure.

You need these numbers in great quantity in order for mathematicians to puzzle out the mathematical order hidden in them such as Kepler's derivation of his three laws of elliptical orbitals, which were derived from the treasure trove of data of Tycho Brahe's measurements of the planetary orbitals. This data could only be got by someone with the obsessional qualities – and the wealth – of someone like Tycho Brahe, who build observatories in several parts of the world, and spent decades compiling his measurements.

For Brahe to have the impact he did, he needed a mechanism to disseminate his observations to scientists around the world. Lucky for him, he had access to the new-fangled technology of the printing press to help him print the books with the measurements. These books would make him famous. Indeed, Brahe himself recognized the centrality of the printing press. In the symbolic plate in the beginning of his book, he draws a printing press as one of the pillars of his scientific endeavors.

Without the printing press, disseminating such a mountain of data would have been nigh on impossible. If you were a student of astronomy, to study Brahe's measurements, you have to get access to Brahe's observatory to study his hand-written tables. If you wanted to take this data home to study at leisure, you would have to copy them out by hand. This would have taken days if not weeks, and would have included hundreds if not thousands of errors, whilst living on the hospitality of the rather prickly Brahe.

Brahe's tables of planetary data is only the most famous example of data made possible by the printing press. Tables of numerical functions vastly improved the calculation abilities of the early scientists. If you wander through old physics libraries, you'll probably stumble on some musty books of tables and tables of numerical function, lost somewhere in the back shelves. Not every mathematician or physicist was a calculating juggernaut such as Gauss or Newton. For the journeyman scientist, books of pre-calculated mathematic functions were absolutely necessary to solve their hairy equations. The values of sines, cosines and logs neatly typed in tabulated form, made it possible to solve equations simply by looking things up. For most of the history of physics, such books of numbers were an indispensable calculation tool.

These books of tables of numbers simply did not exist before the printing press. It was inconceivable that any publishing house would have contemplated the manufacture of such books. Not only would it have been mind-numbingly boring to copy such books by hand, but they would have been notoriously unreliable. Only with the invention of the printing press was it possible to produce such books in a cheap and accurate manner.

Copernicus himself, saw in his lifetime, the publication of high quality books of mathematical tables. He gained access to precomputed tables of sines and cosines, necessary tools of applied mathematics. It is entirely possible that Copernicus may not have been able to make the calculations necessary for his planetary model without such books.

Accessing Big Data in the Age of the Web

Here in the 21st century, we now have a plethora of technologies that extend the practice of science from electronic publishing to terabyte hard drives. Yet conceptually, these kinds of technologies are not fundamentally different to Tycho Brahe's tables of planetary observations. Granted, we can do it larger, faster, and even instantaneously, but the essential relationship with the data remains the same. The data we get is but a static copy of the data from a slice of time in the past.

But data that sits on the web work differently. Wikipedia is perhaps the best example. As the information is there in one place, anyone can edit and change it. And these changes propagate instantaneously to anyone accessing it. The changes are transparent to the consumer of information, so-much-so that the a wikipedia page feels like a living document.

This change in the relation to data first occurred to me a few years ago, whilst working with protein data. For structural biologists, the most common protein data is stored at the Protein Data Bank (PDB). In the old days, you would order a CD-ROM, or FTP a large zipped copy of a particular release. Then you would write your programs and analyze the proteins of your local copy of the PDB. As time passes, you would order another CD-ROM or download the whole library again.

But then, the website to the PDB got much better, as the curators of the PDB adopted many Web 2.0 idioms. It was important that the protein community had come together to beatify the PDB website as the one-and-only destination for protein data. When the data is centralized, curation can happen in one place. You can then be sure that corrections and improvements to the data will propagate to all users through the bottleneck effect. Finally, they rewrote the website to offer a decent Application Programming Interface (API), meaning that you can now access the data through any program on any computer, as long as you have an internet connection.

I ended up dumping my local copy of the PDB and rewrote my programs to pull the data off the website as I needed it through the API of the PDB website. I knew that this was the best way to get fresh data, and I stopped downloading local copies of the database. Apart from the incredible convenience of only getting the data I needed, using a web-centric API means that I access the most up-to-date information without any special effort at all.

The Second Reification of Data

The possibility of science arose from the technology of the printing press during the tail end of the Renaissance. With that magic combination of machine and ink, cheap and accurate copies of text, numbers and figures were made possible. You could say that data was reified for the first time in history. As data became a commodity, the early scientists were able to theorize about nature far beyond their predecessors.

Today, we are on the cusp of an even more intense relationship with data. When stored in a central hub on the web, improvements are fed by researchers who create fresh data, and errors are repaired by researchers who search for flaws. The data attains new heights of integrity. The users of the data can then consume the data at their leisure, safe in the knowledge that they are feeding on the one true copy.

I envisage a near future where all modern scientific data coagulates at critical points on the internet. The essential data for every sub-discipline will be reified at key websites, where the data is both universal and simultaneous. This second reification of data means that data can grow and heal on its own, in an entirely transparent manner, where for all intents and purposes, the data is now alive, a perfect organism in time and space.