Advance of the Data Civilization: A Timeline

The precursors of what we’re trying to do with computable data in Wolfram|Alpha in many ways stretch back to the very dawn of human history—and in fact their development has been fascinatingly tied to the whole progress of civilization.

Last year we invited the leaders of today’s great data repositories to our Wolfram Data Summit—and as a conversation piece we assembled a timeline of the historical development of systematic data and computable knowledge.

This year, as we approach the Wolfram Data Summit 2011, we’ve taken the comments and suggestions we got, and we’re making available a five-feet-long (1.5 meters) printed poster of the timeline—as well as having the basic content on the web.

Historical data timeline

The story the timeline tells is a fascinating one: of how, in a multitude of steps, our civilization has systematized more and more areas of knowledge—collected the data associated with them, and gradually made them amenable to automation.

The usual telling of history makes scant mention of most of these developments—though so many of them are so obvious in our lives today. Weights and measures. The calendar. Alphabetical lists. Plots of data. Dictionaries. Maps. Music notation. Stock charts. Timetables. Public records. ZIP codes. Weather reports. All the things that help us describe and organize our world.

Historically, each one required an idea, and had an origin. Most often, what was happening was that some aspect of the world was effectively getting bigger—and one organization or one person took the lead in introducing a method of systematization.

Sometimes those involved were powerful or famous. But quite often they were in a sense in a back room, just solving a practical problem—usually modestly at first. Yet in time the perhaps arbitrary schemes they invented gradually spread as the need for them increased.

Most people will have heard of Euclid, who defined a way to systematize mathematics, or of Julius Caesar, who standardized the months of the year. Fewer will have heard of Guido d’Arezzo, who in 1030 AD invented stave notation for music. Or Robert Cawdrey, who in 1604 made what was probably the first alphabetical dictionary. Or Munehisa Homma, who in 1755 made what was probably the first market price chart. Or George Bradshaw, who in 1839 made the first train timetable. Or Malcolm Dyson, who in 1946 invented the standard IUPAC notation for naming chemicals.

As one looks at the whole timeline, one can see several definite classes of innovations.

One class are schemes for describing or representing things. Like latitude/longitude (invented by Eratosthenes around 200 BC). Or the notation for algebra (from Franciscus Vieta around 1595). Or binomial species names (invented by Carl Linnaeus around 1750). Or geological periods (introduced around 1830). Or citations for legal cases (from Frank Shepard in 1873). Or CIE color space (from 1931). Or SI units (from 1954). Or ASCII code (from 1963). Or DNS for internet addresses (from 1983).

Another class of innovations are schemes or repositories for collecting knowledge about things. Like Babylonian land records (from 3000 BC). Or the Library at Thebes (from 1250 BC). Or Ptolemy‘s star catalog (from 150 AD). Or the Yongle Encyclopedia (from 1403). Or the US Census (from 1790). Or Who’s Who (from 1849). Or weather charts (from Robert FitzRoy in 1860). Or the Oxford English Dictionary (from the 1880s). Or the “Yellow Pages” (from Reuben H. Donnelly in 1886). Or Chemical Abstracts (from 1907). Or baseball statistics (from Al Elias in 1913). Or Gallup polls (from 1935). Or GenBank (from 1982).

Another class of innovations are more abstract: in effect formalisms for handling knowledge. Like arithmetic (from 20,000 BC). Or formal grammar (from Panini around 400 BC). Or logic (from Aristotle around 350 BC). Or demographic statistics (notably from John Graunt in 1662). Or calculus (from Isaac Newton and Gottfried Leibniz around 1687). Or flow charts (from Frank & Lillian “Cheaper by the Dozen” Gilbreth in 1921). Or computer languages (from around 1957). Or geographic information systems (from Roger Tomlinson in 1962). Or relational databases (from the 1970s).

And then, of course, there is the curious history of attempts to do things like what Wolfram|Alpha does. I suppose Aristotle was already thinking of something similar around 350 BC, as he tried to classify objects in the world, and use logic to formalize reasoning. And then in the 1680s there was Gottfried Leibniz, who very explicitly wanted to convert all human questions to a universal symbolic language, and use a logic-based machine to get answers—with knowledge ultimately coming from libraries he hoped to assemble.

Needless to say, both Aristotle and Leibniz lived far too early to make these things work. But occasionally the ideas reemerged. And for example starting around 1910 Paul Otlet and Henri La Fontaine actually collected 12 million index cards of information for their Mundaneum, with the idea of operating a telegraph-based world question-answering center.

In 1937 H. G. Wells presented his vision for a “world brain”, and in 1945 Vannevar Bush described his “memex”, that would give computerized access to the world’s knowledge. And by the 1950s and 1960s, it began to be taken almost for granted that knowledge would someday become computable—as portrayed in movies like Desk Set or 2001: A Space Odyssey, or in television shows like Star Trek.

The assumption, however, was that the key innovation would be “artificial intelligence”—an automation of human intelligence. And as the years went by, and artificial intelligence languished, so too did progress in making knowledge broadly computable.

As I’ve talked about elsewhere, my own key realization—that arose from my basic research in A New Kind of Science—is that there can’t ever ultimately be anything special about intelligence: it’s all just computation. But where should the raw material for that computation come from? The point is that it does not have to be learned, as a human would, through some incremental process of education. Rather, we can just start from the whole corpus of systematic knowledge and data—as well as methods and models and algorithms—that our civilization has accumulated, poured wholesale into our computational system.

And this is what we have done with Wolfram|Alpha: in effect making immediate direct use of the whole rich history portrayed in the timeline.

I should say that as a person interested in the history of ideas, the actual process of assembling the timeline was a quite fascinating one. We started by looking at all the different areas of knowledge that we cover in Wolfram|Alpha—or hope to cover. Then in effect we worked backward, trying to find the earliest historical antecedents that defined each area.

Sometimes most of us knew these antecedents. But quite often we were surprised by how long ago—or how recent—those antecedents actually were. And in some cases we had to ask a whole string of experts before we were confident that we had the right story.

Each entry on the timeline was written separately—and I was most curious to see what would emerge when the whole timeline was put together. Of course, there is considerable arbitrariness to what actually appears on the timeline, and inevitably it’s prejudiced toward more recent developments, not least because these do not have to have survived as long to seem important today.

But when I first looked at the completed timeline, the first thing that struck me was how much two entities stood out in their contributions: ancient Babylon, and the United States government. For Babylon—as the first great civilization—brought us such things as the first known census, standardized measures, the calendar, land registration, codes of laws and the first known mathematical tables. In the United States, perhaps it was the spirit of building a country from scratch, or perhaps the notion of “government for the people”, but starting as early as 1785 (with the formation of the US Land Ordinance), the US government began an impressive series of firsts in systematic data collection.

Given the timeline, a very obvious question is: how are all these events distributed in time, and space?

Here’s a plot showing the number of events per decade and per century:

Plot showing the number of events per decade and per century

And here’s a cumulative version of the same information:

Cumulative version of plot showing the number of events per century

In the first plot, we see a burst of activity in the golden age of Ancient Greece. And then we see more in the Renaissance, the Industrial Revolution, and the Computer Revolution. But it is notable that there is still at least some activity even in Europe in the Middle Ages.

Looking at the cumulative plot, we see the center of activity shift from Babylon to Greece around 500 BC, then to continental Europe around 1000 AD (after modest activity in the Roman Empire). Around 1600 Britain begins to take off, firmly rivaling continental Europe by the mid-1800s. The US starts to show activity before 1800, but really takes off in the early 1900s.

Here’s how the share of “events so far” evolves over time (and here’s a CDF interactive bar graph version):

Pie charts illustrating how the share of events so far evolves over time

Ancient Greece surpasses Babylon in 250 BC. Europe surpasses Greece in 1595. Britain briefly surpasses continental Europe in 1786. The US surpasses Britain in 1942, and all of Europe in 1984—and today is only 12% short of surpassing everything before it put together.

It’s notable how concentrated everything is in the typical “Western Civilization” countries. Perhaps this reflects our ignorance of other history, but I rather suspect it reflects instead the different interests of different cultures—and their different approaches to knowledge.

One of the most obvious features of the plots above is the rapid acceleration of entries in recent times. As I mentioned before, there’s inevitably a survival bias. But to me what’s somewhat remarkable is that nearly 20% of what’s on the timeline was already done by 1000 AD, 40% by 1800 and 60% by 1900. If one looks at the last 500 years, though, there’s a surprisingly good fit to an exponential increase, doubling every 95 years.

Now remember, the timeline is not about technology or science, it’s about data and knowledge. When you look at the timeline, you might ask: “Where’s Einstein? Where’s Darwin? Where’s the space program?” Well, they’re not there. Because despite their importance in the history of science and technology, they’re not really part of the particular story the timeline is telling: of how systematic data and knowledge came to be the way it is in our world. And as I said above, much of this is “back room history”, not really told in today’s history books.

In Wolfram|Alpha, we also have a growing amount of information about more traditional science/technology inventions and discoveries. And the timeline for these looks a little different. There is much less activity in the Middle Ages, for example, and in the last 500 years, there is growth that rather noisily fits as exponential, with a 75 year doubling time. If anything, there are even more dramatic survival bias effects here than in the data+knowledge timeline. But if there is a significance to the difference between the timelines, perhaps it reflects the fact that the systematization of data and knowledge provides core infrastructure for the world—and grows more slowly and steadily, gradually making possible all those other innovations.

In any case, as we work on Wolfram|Alpha, it is sobering to see how long the road to where we are today has been. But it is exciting to see how much further modern technology has already made it possible for us to go. And I am proud to be a small part of such a distinguished and long history. And if nothing else, laying out the history makes a nice poster

4 comments

  1. Thanks for the post. I find it interesting how much data organization is innovative, but increasingly not something humanity needs to internalize. Rather than understand the structure of chemical naming, or the way zip codes work, it’s far easier to use Wolfram | Alpha or Google to find this out. Do you see the trend of abstracting such knowledge and organization patterns into 1 computational or search engine continuing?

  2. What is your view on the validity of the chronological data before 1500ad? In particular what do you hold off the statistical methods for determining the creation date of narrative texts or events mentioned therein (A.T.Fomenko).

  3. Great compilation and Sir, what about the period before 4000BCand
    5000BC..you have to turn to India that is the only place you will find
    more accounts of synthesized knowledge. It’s still there.

    Was it left out intentionally in your research?
    I guess no but at that point in time it may have looked like more to comprehend.
    to be fair..I feel its not complete until you dont have the Vedic civilization. nothing ever is.

    Erstine