New Data Cloud Alliance promises to make data more accessible to AI.
The Challenge of Gaining Access to Data
One of the biggest challenges to digital transformations and the implementation of artificial intelligence (AI) systems is the access to data. The hope is that AI-influenced decisions will help engineers and scientists solve the big challenges facing the world—like climate change, overpopulation, renewable energy and cybersecurity. But to make good decisions and train the systems, AI needs data.
To that end, Google Cloud recently announced its Data Cloud Alliance with 11 other data companies as “a new initiative to ensure that global businesses have more seamless access to and insights into the data required for digital transformation.” The group also includes:
- Accenture
- Confluent
- Databricks
- Dataiku
- Deloitte
- Elastic
- Fivetran
- MongoDB
- Neo4j
- Redis
- Starburst
The simple goal of the group is to improve the access to and insights into data. The group’s core principles revolve around accelerating adoption through common practices, reducing challenges around security and solving the current skills gap. Alliance members are committed to building application programming interfaces (APIs), supporting customer integration and building infrastructure for artificial intelligence.
According to IDC, AI is expected to grow 19.6 percent in 2022 to an evaluation of $43.8 billion. The market is then on track to break $500 billion in 2023. Clearly, while there are huge opportunities for engineers to harness AI to help make the world work faster and more efficiently, we need the data to get started.
Standardization, Artificial Intelligence and Data Collection
The field of artificial intelligence is trying to govern its relationship to data collection as it is being created. Leaders in the field are making sure that data is accessed and used legally and without violating privacy as it is a huge concern for both customers and business owners.
The existing standardization landscape is a combination of people calling for rules, organizations creating their own rules and companies moving forward with internal ones. And all this development is happening while the AI train has left the station.
In 2019, the National Institute of Standards and Technology (NIST) published “U.S. Leadership in AI: A Plan for Federal Engagement in Developing Technical Standards and Related Tools,” which listed a six-page laundry list of AI standards—on cybersecurity, privacy and data management—that were existing or in development. Professional societies, standards organizations, international telecommunication groups and the Consumer Technology Association were all owners of these standards. Since 2017, the International Organization for Standardization’s (ISO’s) Artificial Intelligence Committee has single handedly published 11 ISO standards, with 26 more in development. It’s clear that there needs to be a standardization of these guidelines into a single workflow so that data can be shared safely and ethically.
This is one of the issues that Google and its partner in the Data Cloud Alliance can help to address, by throwing their weight behind one of these standards and helping to streamline the process of inputting data from ethical and legal sources.
How Are Google’s Competitors Addressing the Challenge?
Glaring omissions from the Data Cloud Alliance are Google’s competitors in the field. If there is going to be a single standard, then there will need to be some agreement between the competition. So, how are other cloud companies addressing the problem of supplying data for AI purposes and how does that compare to what Google currently offers?
Amazon Web Services (AWS) already has a large AI presence. It offers two huge data resources for engineers working on AI projects. First, the Registry of Open Data is for people to find existing datasets and share their work with others who might be working in the same industries—or trying to solve the same problem. Interested parties can upload their data to the AWS repository, where the information can be accessed globally without charge.
Then there is the AWS Data Exchange, where engineers can subscribe to data feeds or access datasets on a single-use basis. This is more of a marketplace where data can be supplied and consumed.
Azure also has its own AI branch, with a vast amount of products and tools to manage data. For instance, Microsoft Purview is a data control tool that was built to help companies manage data at the organizational level. Reports can be generated to show exactly what data is available to a company across clouds, or on different platforms, and how that data is being used. Classifying datasets and building keywords in the system helps engineers to get access to the relevant information they need for a project.
Azure also has its own method for data producers to share ideas—the aptly named Azure Data Share. Data Share operates less like a marketplace and more like an interface to let data owners send information to users outside the organization. While engineers can ship data back and forth to each other while collaborating on a project, there is no mechanism to search here for data that might be needed for an application.
Giving data owners a place to supply data without worrying about licensing or infrastructure feels like a big boost, and I’m assuming one or all these solutions will be similar to the infrastructure that the Data Cloud Alliance hopes to create. Perhaps, by spearheading this alliance, Google is hoping that the group will standardize on its data sharing tool—Datashare—which is similar to the ones offered by AWS and Azure?
What Does It All Mean?
A consideration for any adoption of AI is the challenge of collecting the data. One of the problems many of us face as engineers is that we focus on the project at hand, pushing the business side of things, like collaborations, to the background.
The Data Cloud Alliance, in contrast, is a group of businesses. Their action-oriented thinking is why the alliance homepage has a link for interested parties to join, as well as links for customers to get on the Google Cloud. These competing constraints of technological progress, social progress and making money are always present as we work toward an AI future.
On some levels, this reminds me of sending CAD files to other companies in the 1990s. How did we know that the file was going where we wanted it to go with privacy and security intact? As a project engineer, it was only important for me to get the data to my designer as fast as possible. I wasn’t worried about the security concerns in the process but was more focused on checking that the revision for a specific part number matched what the customer or vendor had sent. Now in 2022, many companies have resources available for designers to download CAD models directly from the product page in a catalog. Making this data freely available might not create new customers just through the act of downloading an IGES file, but it saves time and gains goodwill with current and potential customers.
Google, Amazon and Azure might all gain goodwill just from the fact that the exchanging of AI data allows so many of its customers to pull from sources without swimming through additional processes or approvals.
Will Google Cloud work with AWS and Azure to share their data pools and push forward the potential of artificial intelligence? Or will some other organization decide to take the freely available datasets from these big players and create a conglomerate? The one thing that’s clear in the fluid field of artificial intelligence is that engineers are going to need data. And that data should be easily accessible, secure and ethically produced. Hopefully the Data Cloud Alliance is one of the first steps to building our data future with new tools, infrastructure and partnerships.