Continuing the journey
If anybody doubts that history runs in cycles, they obviously don't know the technology business. From the time when we bought our first PC by mail order – for about $3000 in 1980s money (equivalent to well over $8000 today) – we’ve seen the technology scene oscillate. But each time the cycle repeats, there is a new twist that stirs up the broth. This business doesn’t get boring.
Rewinding back to the big hair 80s, it was about “downsizing the mainframe” to distributed open systems. The center of the universe moved from the mainframe to a bunch of midsize servers flanked by rich clients. But rich clients proved too much of a good thing as Gartner came out with a study showing the TCO of client/server was dangerously in the red because of the huge maintenance burden. Nothing comes for free. So, it was time to re-consolidate the data.
But on this go-round, there would be a new twist. Sun acquired part of Cray to build those massive Solaris UNIX servers that became the dot in dot com because, at the time, Windows couldn’t scale.
But those centralized databases on enormous Sun boxes still choked on the I/O. The iron could handle it, but not the database engine. Having been designed for the wall gardens of enterprises, enterprise databases simply weren’t prepared to handle traffic from the unwashed masses out on the public Internet (oh, what a difference a couple decades make). And nobody was about to turn the clock back to the proprietary systems that supported massive OLTP applications like Sabre. Enter distributed systems again in the guise of n-tier. Appservers and integration hubs became the de facto standard enterprise middleware, until they too were commoditized by Apache Tomcat. Hold that thought.
Y2K theoretically cleared the cobwebs, or did it? While many organizations responded to Y2K by replacing their jerrybuilt transaction applications with ERP (putting SAP on the map), many others took the cheap way around with windowing approaches that avoided the need to forklift replace all those back-end systems with two-digit years. Meet the new Y2K-compliant system, same as the old system.
Around the turn of the millennium, conventional wisdom was that the relational database was the endgame for data management, while the AppDev world neatly settled into Java and .NET camps, just like at one point during the mainframe era, the world was divided mostly between FORTRAN and COBOL.
Time to close the patent office?
Not so fast. For most organizations, building websites based on Sun SPARCservers running Oracle on Solaris was overkill; they needed cheaper, simpler platforms. For developers, Java and .NET didn’t serve everybody’s needs. Purpose-built languages like JavaScript, Perl, and Python emerged. And thus came the LAMP stack. Making it possible was a new model of software development: open source. A skunkworks project led by Linus Torvalds grew more viral than he ever expected. A clean room UNIX, simplified without all the bloat and engineered to work even on older 286 machines. Reboot them once a year if you wanted to clear your conscience. Eric S. Raymond articulated the new model with his tome on The Cathedral and the Bazaar.
Beyond being “free” software (nothing is truly free), a key premise of open source was actually a more robust development and maintenance model. Instead of relying on a single vendor, have development and maintenance occur out in the wild. The rationale behind the community model is, if the open source project gained critical mass traction, you would have the world’s largest virtual development team managing and maintaining it. New features could get added more readily, but more importantly, with the full reach of the community, bugs could get addressed quicker. Well, maybe not always.
So how did the Linux open source experiment go? A quarter century later (actually, last month to be exact), a Linux Journal conversation between Bob Young, the cofounder of Red Hat, and Torvalds, revealed that world domination was no longer such a joke.
Young and Torvalds may have been speaking about Linux, but when it comes to world domination, we’re also talking about open source. We’ve seen the impact locally here in the New York area. As we noted a few weeks back in our ZDnet column, New York has always been a tech town, but before open source, it was all locked away; nobody spoke to each other. Today, there are tech meetups almost nightly, Wall Street firms are increasingly adopting open source-first policies for core infrastructure so they don't have to reinvent it themselves. More importantly, they are encouraging their open source developers to go out and talk –both to keep them happy and as a subtle talent recruitment strategy.
For open source, the cycle is coming back again. Over the past year, a midlife crisis has set in where the companies behind popular open source projects like MongoDB, Redis, Confluent, MariaDB and others risk becoming victims of their own success. They are concerned about cloud providers – big and small – profiting off the skin that they put in the game. And so many of them are coming up with weird licenses designed to balance open source with restrictions on third-party cloud providers that are posing a potential quandary to enterprise legal departments. A few weeks back, we proposed yet one more look at a familiar model: open source companies should return to the familiar open core model.
Let’s look at another cycle that hits home for us: data platforms. Relational databases cleared the way for enterprise applications – no longer did the application have to also include its own data store. Relational databases freed developers, because they could navigate to data logically with declarative languages like SQL and they didn’t have to worry about writing procedural code to target where the data physically lives.
But not all the world’s data fits neatly into the relational model. In the 1990s, there was a flirtation with object-oriented databases. Whether they could scale or not is a question for another day. But suffice it to say that by that point, SQL was on its way to becoming the de facto enterprise standard that ODBMSs couldn’t make a dent into the huge SQL skills pool.
Yet in the 2000s, the database world began rethinking those assumptions. According to Eliot Horowitz, who went on to cofound MongoDB, lots of data was simply too rich to cram into columns and rows; documents seemed to be the way that most data naturally fits. And if you look at JSON, which is based on a construct that is already highly familiar to developers (thanks to the popularity of JavaScript), you could term document databases as poor man’s object databases, minus the overhead and complexity of inheritance and polymorphism.
But what about scale? In the 2000s, Internet companies were dealing with insane torrents of data. Facebook’s MySQL data platform hit the wall with data volumes so huge that the company could not back up its MySQL database within the 24-hour window. So much for the LAMP stack. And when it came to building a search index, Doug Cutting and Mike Cafarella at Yahoo had to search for a new massively parallel compute and storage technology. Research published by Google provided the hint on how to build for scale, which led to companies like Yahoo, Facebook, LinkedIn, and others to develop open source implementations that resulted in platforms like Hadoop and Cassandra.
Remember back in the 1990s when all those SQL Server and Oracle databases back-ending websites were so swamped that we needed a middle tier to separate business logic from the data? Well now, the data volumes being analyzed were so huge that we had to collapse everything all over again — we had to bring compute back to the data. So much for Tomcat. But as you’ll see, this wouldn’t be the final chapter of this saga. Hold that thought again.
The new NoSQL and big data platforms that emerged in the 2000s were another case of the cycle repeating itself, but with the new twists of schemaless or schema-on-read designs and scale through replication or sharding. And with Hadoop, it brought back another practice that had more in common with legacy data platforms that had the business logic intermixed: it brought compute to the data because the data was so massive that transporting via a network would introduce too much latency.
Until the network latency issue became less important. The emergence of cloud-native architectures, that separate storage from compute to support elasticity, added yet another new turn to the cycle and a new twist. Today, most (not all) cloud native databases still run with attached storage, but outliers like Snowflake are separating compute from storage. Before cloud became big, engineered systems like Exadata built that separation into their appliances, albeit on a smaller scale than massive cloud data centers. Today, cloud-based Hadoop and various analytics services, cloud-native architecture is increasingly becoming the norm. Amazon EMR, Azure HDInsight, and Google Cloud Dataproc typically run on cloud storage, although the customer can specify file systems (HDFS or Google’s current iteration of GFS) if performance and data locality are requirements. With services like Amazon Athena, Redshift Spectrum, Azure SQL Database Hyperscale, and Google Cloud BigQuery, queries can run directly against cloud storage (which we believe is becoming the de facto data lake).
And did we mention cloud? That’s another technology that’s gone full circle. Amazon had some excess compute cycles that it wanted to monetize. We all know how where that story went, but in a way, AWS EC2 was not a virgin concept. Remember timesharing? In the early days of mainframes, few organizations could afford their own computing systems, so they rented time from shared remote facilities. As the IBM 360 ushered in wider affordability, timesharing went the way of hula hoops, but when the Internet emerged as an alternative to expensive, dedicated private links, Application Service Providers (ASPs) emerged to provide an alternative to remote hosting based on the notion that standard architectures (e.g., offer SAP hosting) would bring economies of scale. The ugly truth was that, even with the same product, organizations so customized their enterprise systems that the architectures were not so standards, and so ASPs became another dot com casualty. It took Salesforce to move the needle with a standard application that was only available online and with multi-tenancy.
Back to databases. As the pendulum swung back to separating storage from compute, a couple “new” concepts emerged to do the same for the application tier: containers and microservices. Containers adapted the virtual machine hypervisors popularized by VMware to yet a simpler, lighter weight form: eliminate most of the OS from the VM, which eliminates significant complexity and overhead. As for microservices, anybody around during the early 2000s might recall service-oriented architectures (SOA), which also strove to abstract application logic from its physical implementation. Far less successful than VMs, SOA crashed of its own weight and complexity when standards committees couldn’t even get something with “Simple” in its first name (SOAP) to live up to its billing. As their name attests, microservices took that to a much more granular level, aided and abetted by emerging de facto standards at the container level, including Docker and Kubernetes.
Nonetheless, with cloud-native architectures, databases and applications can be completely rethought. Remember bringing compute to the data? Elastic architectures force a rethinking of all that.
If you’re willing to put up with the pain of transformation, those hairball databases and applications can be rethought, with the sheer scale and commodity infrastructure of the cloud redefining how replication, high availability, disaster recovery, ACID transactions, and compute are managed. Although most of the cloud providers are not (yet) ready to admit it, innovations that add intelligence to the storage layer can be shared across multiple database platforms or data models. Data silos could be eroded. And with applications containerized, and with open source projects like Kubernetes becoming de facto standard, the path is being built for enterprises to rethink how they run their core business systems and get new applications into production. Then apply some machine learning to the running and operation of the database, and you can refactor the daily lives of DBAs.
Until the last sentence, we haven’t even bothered to mention the elephant in the room: AI and machine learning. That’s yet another case where old is now new. Thirty years ago, we recall coming back from an AI conference in Seattle, writing a post with the headline, “AI’s out, expert systems are in.” That being 1987, there’s no online record of our news filing. Dare us, and we’ll photocopy the page and add a link. What’s funny today is that the headline would likely be reversed, as expert systems are just too rigid and rules-based; machine learning provides a far more adaptable approach.
AI came out of its winter, not necessarily because the algorithms got better, but because the equivalent of Moore’s Law for storage, networks, and compute made AI (or at least, for now, machine learning) practical. AI models are ravenous for data, and fortuitously, there’s enough of it around, the network pipes are fat enough to get the data to the compute, and the compute itself has gotten far less expensive and more powerful. One big advance is that now we know how to scale linearly, something that was only a dream as recently as the early 2000s. The potential for AI is only limited by the imagination.
While it might seem that repeats of past history get boring, the new twists are what has made this journey a wild ride. We thought that the database world met its end state back in 2000, but today, best practices for how to harness the cloud to upend how we manage, transact and analyze data are only just emerging. Enterprises are grappling on how to manage and govern that data, and platform and solution providers are only just scratching the surface on how they can tap the scale and distributed reach of the cloud to evolve their offerings.
So, if you’re wondering why we’re starting a new business to explore how the cloud and AI will make database management and analytics different, this long-winded tour through IT history hopefully provides that answer. Having double-majored in history and journalism, it’s probably not surprising that we’ve been documenting this journey, with our original onStrategies Perspectives blog providing 15+ years of roving commentary dating back to 2000-2001, and since 2016, posting our insights on ZDnet.
It’s been a wild ride. At dbInsight, we’re hoping you’ll join us to learn what the next cycle of data-driven innovation is going to throw at us.