Improving the Engineering Utility of Data Lakes

Oracle tool aims to make data lakes compute ready.

Last month, Oracle announced the Autonomous Data Warehouse solution, which is designed to be a database packaged with data analytics and machine learning to improve the utility of data lakes. The solution’s other attractive features include its cloud-native, multicloud management and its inexpensive, optimized data storage. Oracle hopes these features will make the solution distinct from more traditional closed and siloed data options—especially for engineers.

(Image courtesy of Oracle.)

(Image courtesy of Oracle.)

In conversation with George Lumpkin, vice president of Product Management for Oracle Data Warehouse and Autonomous Database Technologies, engineering.com discussed the new tool and where the company is headed with this suite of solutions.

To begin, Lumpkin emphasized the high scalability and elasticity of the solution. The aim is to provide a single cloud service that brings together all the pieces a large enterprise needs to optimally use its data. The overarching goal of the solution is for any type of data to be integrated and accessible in one platform. Plus, Oracle has a major focus on utility: it doesn’t want to simply build a database, it wants the data to be as useful and accessible as possible to anyone—regardless their level of technical expertise. Throughout the conversation, Lumpkin emphasized that Oracle wanted to listen to and respond to the needs of its customers in creating the new features of the Autonomous Data Warehouse.

New Features for Oracle Autonomous Data Warehouse

  1. Expanding the Multicloud
  2. Data is everywhere. The ongoing problem across industries is not data collection but how to make data useful. The Autonomous Database solution aims to simplify multicloud data warehousing through secure access to data storage in Amazon Web Services, Azure, and Google Cloud, and it can connect directly with databases in Azure SQL, Azure Synapse, Amazon Redshift, Apache Hive and more. Through Oracle SQL, simple APIs can access data from any of these sources to feed into advanced data analytics. Essentially, the Autonomous Database acts as a single cloud-native data repository that connects to all public clouds and offers flexibility in choice; a feature Oracle hopes will help its solution stand out from the crowd.

  3.  Rethinking the Data Lake
  4.  With so much data being generated, lots of it ends up destroyed or discarded to save money. But what if that data ultimately ends up being useful? The advent of data lakes changed how many companies view data storage and utility. A data lake acts as a low-cost storage strategy to store and use data that might otherwise be discarded. However, one issue with open and interoperable data lakes is that they are not optimized for any compute tasks. So even though SQL can be used in traditional data lakes, with a focus on lowering data storage costs, the architecture is ultimately not optimized for SQL performance.

    Lumpkin emphasized that the Autonomous Data Warehouse aims to keep storage costs low without sacrificing performance. Instead, all data required for SQL-related analytics can be stored in the Autonomous Data Warehouse due to a recent 75 percent reduction in storage costs. The highly optimized tool facilitates inexpensive object storage while improving query performance. With this solution, it seems Oracle wants to stop companies from splitting up their data into different storage products to save on costs.

  5. Straightforward Data Access and Analysis with Data Studio
  6. The Data Studio solution is a built-in, no-code user interface that allows users of any technical experience to develop solutions and analytics pipelines without relying on complex IT assistance. The solution supports over 100 data sources across a diverse array of data types, and new features allow data wrangling to any specific format or requirement. The Data Studio includes plug-ins for Microsoft Excel and Google Sheets to facilitate a direct connection between the Autonomous Data Warehouse and common spreadsheets to ensure data consistency and improve communication. By focusing on making the interface as user friendly as possible, the data studio suite makes it easy to access and launch analysis, regardless of a user’s technical expertise. With built-in data transformers, the solution aims to be as simple and straightforward as possible.

  7.  Easy and Open Data Sharing
  8. When communicating with nontechnical stakeholders, traditional data sharing is clunky and insecure. In the past, many people simply uploaded data into a comma separated values (CSV) file and shared it via an email service. Despite seeming straightforward, this type of data sharing is not secure and cannot be continuously updated as new data is acquired or analyzed.

    In contrast, the open data sharing solution within the Autonomous Data Warehouse facilitates secure sharing both within and outside of an enterprise. The process involves creating a “data share” instance that can include any number of datasets or tables from any source. The owner of the data share decides what goes into the share and who can read the resulting file. The recipient then remotely accesses that centralized data instance, ensuring the data remains secure, appropriately governed and continuously up to date. At any time, the owner can revoke access to the data share to any or all the recipients. Plus, the owner can see when the data is accessed and how frequently it is viewed.

    Overall, the data sharing solution aims to be scalable while optimizing bandwidth, security and access. The solution also supports the open standard Delta Sharing protocol as both a data provider and a data recipient.

Supporting the Multicloud and Beyond

One of the most enticing offerings of the new solution is high-performance storage at the same cost as object storage. This means that your data lake becomes an optimized database, facilitating inexpensive storage while still allowing for rapid data analytics. Overall, the data lake becomes less of a “lake” and more of an intentional element of the overall IT architecture. Instead, all data, regardless of its obvious utility, can be stored in an accessible manner optimized for analytics and computation. It can also better facilitate experimentation and data manipulation within the warehouse.

Many of these solutions can be used for product design and management, allowing companies to generate, store, analyze and utilize all the data collected during manufacturing, testing and shipment. With tools like the Autonomous Data Warehouse, engineers can potentially better utilize data from disparate locations: across public clouds and nearly any data generating source. With all the data in one cloud-native managed database, anyone, regardless of technical expertise, can begin to use analytics to make informed decisions at all levels of management—whether that is deciding what prototype is performing best or identifying bottlenecks within a supply chain network.

Overall, the cloud-native, autonomous nature of the Oracle solution seems to make it an obvious choice for multicloud management. This is because Oracle is trying to differentiate itself by remaining keenly invested in an open, multicloud future. Over the next few years, it will be interesting to follow how other major cloud companies respond to the multicloud and how—or if—they choose to invest in increasingly open platforms focused on interoperability.