Moving to the supercloud means rethinking data management
In becoming the largest privately held freight transportation company in North America, Estes Express Lines Inc. created a lot of data silos. Following years of acquisitions, diversification and geographic expansion, by 2021 the data fragmentation picture across the company was “pretty bad,” said Bob Cournoyer, senior director of data strategy, business intelligence and analytics.
Core financial and operational data were kept on an IBM Corp. AS/400, but up to half of critical data was spread across multiple cloud systems. “We also had a ton of internal SQL Server databases,” Cournoyer said. “Our challenge was to figure out how to tie it all together.”
Over the past two years, Estes Express has discovered, mapped and indexed about 90% of its core data using a data virtualization platform from Denodo Technologies Inc. The freight carrier also overhauled its approach to data management, reorganizing the information technology department into product teams aligned to business functions and parked data with product owners.
Denodo’s GraphQL service enables users to run queries against the virtual data model without requiring copies or extracts, dramatically reducing a process that used to “take forever because people would spend a lot of time looking for data,” Cournoyer said.
Data virtualization and a metadata catalog now provide a base view of all indexed data that “looks exactly like it does in the source system: same names, same data types, everything,” he said. “If I want to query against that, all I have to do is execute the view.”
As a result, data engineers spend much less time writing extract/transform/load or ETL code and moving data around. The size of the data management team has been reduced by more than half. “They now spend most of their time serving up the data,” Cournoyer said. “We’re morphing into a data service bureau as opposed to the old way that we did it. This platform offered an opportunity to give everyone a window into all our data assets.”
Estes’ computing environment isn’t technically a “supercloud,” but the company has put the data foundation in place to take advantage of the hybrid multicloud environments that are rapidly defining the new enterprise computing landscape.
What’s in a name?
The supercloud, which is also variously called the “meta cloud,” “cross-cloud” and “poly cloud,” refers to a computing architecture in which resources from multiple public and private cloud platforms are blended in a way that is nearly invisible to the customer, usually through the use of an abstraction layer and distributed data management. The goal is “to deliver additional value above and beyond what’s available from public cloud providers,” says David Vellante, chief analyst at SiliconANGLE’s sister Wikibon market research firm.
The abstraction layer allows developers to stitch together best-of-breed services in a single application while accelerating the development process because “they don’t have to think about all those other capabilities, where they come from or how they’re provided. Those are already present,” Jack Greenfield, vice president of enterprise architecture and global technology platform chief architect at retail giant Walmart Corp., said in an interview for theCUBE’s Supercloud2 event that kicks off Tuesday, Jan. 17.
By all indications, the time is right for the supercloud. Cisco Systems Inc.’s 2022 Global Hybrid Cloud Trends survey found that 82% of IT leaders have adopted a hybrid cloud while just 8% use only a single public cloud platform. Flexera Software LLC reported last year that 92% of enterprises have a multicloud strategy.
Experts say building a supercloud is about more than unifying developer toolsets and administrative controls. The process requires organizations to overhaul their approach to data management around a consistent set of ownership and governance principles anchored in a unified view of data.
That’s a big shift for a lot of companies because data has historically been tied to applications. That has resulted in the formation of disaggregated islands of information that makes data hard to find and combine for business insights. Years of acquisitions and point projects have further scattered data to the winds. Software-as-a-service has contributed to the chaos by spreading information across a patchwork of special-purpose clouds.
Data warehouses have long been used to unify the data needed for strategic decision-making, but the technology and labor required to maintain them make their costs sky-high. “If you needed to find out what was going on in the business, you needed to extract the data and move it into a data warehouse,” said George Gilbert, principal at data consultancy TechAlpha Partners. “Everything is now data-centric instead of application-centric.”
Killing the ‘frankencloud’
A distributed cloud architecture without sufficient controls on where and how data is used creates what Hillery Hunter calls “frankenclouds.” “You have islands of consumption that are managed differently and that can open you up to cyberattacks,” said Hunter, who is chief technology officer for IBM Cloud. “A lot of customers are saying that random acts of cloud usage are also incurring unnecessary expense. They want to break down the frankencloud.”
Creating a unified view of data isn’t simple. It can require organizations to dig through hundreds or thousands of isolated repositories, apply metadata tags and catalog their data assets to make them universally discoverable. And in an environment in which responsibility for data quality and maintenance ownership is increasingly moving closer to the people who own the data, creating a central resource is impractical.
“We need to embrace the fact that having all the data in one place for governance is not a feasible approach,” said Adit Madan, director of product management at distributed file system developer Alluxio Inc.
“In the ’90s it was about bringing everything together into one central solution. Today there’s no way we can support that,” agreed John Spens, head of data and artificial intelligence for North America at IT consultancy Thoughtworks Inc. “While policies may need to be centralized, the ownership has to be distributed.”
That’s the approach eBay Inc. took during a five-year-long modernization of its computing architecture around edge computing and distributed services. Platform services were centralized to provide consistent data lifecycle management processes and metadata, but product teams manage their own data based on a standard set of tools and processes. The result “is a shared responsibility model that pushes ownership of data to lines of business but leverages centralized services to manage eBay’s data and related metadata,” said a spokesperson for the e-commerce firm.
Companies that have started down the supercloud trail say a base set of tools and practices are essential to laying the groundwork for an integrated data fabric. A key step is to adopt a model-driven approach to application services that deals with business entities rather than data elements. The company’s data assets should then be mapped to those entities.
Gilbert cites the example of ride-hailing firm Uber Technologies Inc., whose logistics software is based on entities such as drivers, riders and prices instead of rows and columns. “Those are activities that are done autonomously and don’t require a human to type something into a form,” he said in an interview. “The application is using changes in data to calculate an analytic product and then to operationalize it, assign a driver and calculate a price.” The concept is similar to “digital twins,” which are virtual representations of physical entities that can be used for design and modeling and can be manipulated at a high level rather than with SQL queries.
Model-driven development “is a fairly major shift in the way we think about writing applications, which is today a code-first approach,” Bob Muglia, former CEO of Snowflake Inc., told SiliconANGLE in an interview. “In the next 10 years, we’re going to move to a world where organizations are defining models first of their data, but then ultimately of their entire business process.”
That will require semantic frameworks that translate role and column addresses into business terms to make data more accessible. “We used to have people who were technically literate, data-literate or computer-literate,” said Andrew Mott, data mesh practice lead at Starburst Data Inc., which sells a commercial version of the Trino distributed query engine. “In the future, we’ll need people who understand the business and the data that supports it.”
Printer and imaging equipment maker Lexmark International Inc. is preparing for just such a day. The company has embraced software containers and infrastructure-as-code to simplify software development and is consolidating analytical data into a flexible and low-cost “lakehouse” architecture. “We are absolutely embracing hybrid multicloud,” said Chief Information Officer Vishal Gupta.
Lexmark is seeking to eliminate data silos through a three-pronged strategy. It created a data steering team composed of senior leaders from multiple groups that promote collaboration on data and consistent procedures for managing it.
Through a partnership with North Carolina State University, it has grown its pool of data science talent tenfold to 50 people over the last few years, with nearly half distributed to business units. It’s also expanding its existing analytics expertise into machine learning with the goal of harnessing the value of its more than 1.5 million sensors.
Lexmark created an end-to-end data architecture around metadata discovery and a master data catalog. It has built more than 500 connectors to integrate data from around the business and even filed patents on some of its data integration inventions.
“We want to have a centralized architecture but distributed access,” Gupta said. “We want people from any group to be able to use data but with a standardized platform.”
Sophisticated metadata management is crucial to reaching that goal, experts say. Although there is no right or wrong approach to centralizing data, having a consistent nomenclature for defining the data you have is essential to making it useful across multiple cloud applications. “There is more metadata required because more people need to discover data,” said Starburst’s Mott.
At Estes Express, data discovery used to be a guessing game, Cournoyer said. “A lot of the old stuff wasn’t even documented,” he said. “We had tribal knowledge and business analysts who kind of knew where things were, but if you wanted to solve a problem, you’d have to put in a request and get an analyst assigned to you. Today you can go into Denodo and search for what you need.”
Building an overarching view of data is only possible if everyone cooperates in the process. However, “many times people are afraid to give up data because data is power,” said Lexmark’s Gupta.
The District of Columbia Water and Sewer Authority is using business intelligence and machine learning as strategic tools to optimize its reliability, resiliency and sustainability initiatives while continuously realizing efficiencies and reducing costs in its cloud-first strategy. The authority uses its vast “internet of things” and operational data-gathering capabilities to feed its machine learning-based analytical applications.
DC Water uses Microsoft Corp.’s PowerBI data visualization software and Azure Machine Learning services to provide analytical capabilities to staff that doesn’t require formal data science training. “We’re supporting and encouraging teams to upgrade from complex Excel models and develop ML and PowerBI-based capabilities,” said Aref Erfani, senior program manager for data and analytics. “We give them the playground and all the necessary tools within appropriate security guidelines.”
Training, demonstrations and prototypes promote the more robust capabilities of machine learning and business intelligence software compared to spreadsheets as well as the value of sharing information. That, in turn, reinforces the importance of data quality.
“Once you teach the gospel of ML and related analytics, the hunger for more data grows and it gets people to realize that you need to ensure data quality and consistency across the enterprise,” Erfani said. “Data is a corporate asset now and we all need disciplined asset management.”
DC Water has a huge opportunity to improve decisions as a data-driven enterprise “but that will only happen if data is easy to access and understandable within various and appropriate contexts,” Erfani said, “not just reams information with no apparent purpose.”
A question of money
For smaller organizations in particular, the question of whether to invest in a data management overhaul for the supercloud may come down to the more prosaic question of cost. The tooling to manage data across multiple clouds is still nascent, a fact that prompted Lexmark and Walmart to develop their own intellectual property. Such an option isn’t typically available to small firms.
“The technology is not there,” said Sanjeev Mohan, a data analytics consultant and a former Gartner Inc. research vice president. “There isn’t as yet a tool that lets you say you want to use Databricks for machine learning and then you can point and click and that workload runs on Databricks.”
There are also questions of complexity, cost and security in moving data between cloud platforms. “The cyberattack surface of a supercloud is greater because of the differences between environments. The protection of data in transit and in use are critical,” said IBM’s Hunter, adding, ominously, “Be sure you’re working with a provider that can be trusted with your data. This has not been a consistent practice across the industry.”
Cloud providers don’t make it easy for customers to transfer data between them. “Therefore, there is no immediate synchronization of data from one cloud platform to another,” said Cameron Davie, principal solutions engineer at data integration provider Talend Inc.
That creates the risk that customers will copy the same data to multiple clouds, thereby creating redundancy and a host of problems such a scenario invites. “Data could and should be distributed across multiple cloud providers and multiple regions but not necessarily made as full copies,” Davie said.
And then there are egress fees, which are the controversial charges cloud providers levy on customers to access their own data. Those up-charges can add up in data-intensive use cases such as machine learning. “We have customers who are replicating a hundred petabytes of data,” said Alluxio’s Madan. “At that scale, there is no end; the costs keep on accumulating.”
That may be a deal-killer for small companies, at least in the short term. Corporate giants can negotiate lower egress fees or use inexpensive direct connections for large data transfers, but that luxury is not available to everyone.
“If my opinion the supercloud will only be something that very large enterprises are going to want to do,” Mohan said. “They will bake in the egress costs. For small enterprises, egress will be an issue and supercloud will bring a level of complexity they are not ready for.”
Which doesn’t mean they shouldn’t start thinking ahead. After all, falling costs are one of the few things one can count on in this industry.
Photo: Flickr CC
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One-click below supports our mission to provide free, deep and relevant content.
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.