How Microsoft and Databricks are building a modern, cloud-native analytics platform

Microsoft and Databricks are two of the leading companies in the field of data and artificial intelligence (AI). They have been collaborating since 2017 to bring the best of both worlds to their customers: the power and flexibility of Apache Spark, the most popular open source framework for big data processing and machine learning, and the security and scalability of Azure, Microsoft’s cloud platform.

In October 2022, they announced a deeper partnership to evolve their analytics platform to an open and governed data lakehouse foundation. This means that customers can now unify their most demanding business intelligence, machine learning, and AI workloads on a single data foundation that combines the benefits of a data lake and a data warehouse. This foundation also enables customers to responsibly democratize their analytics data products to accelerate digital transformation applications across their organizations.

In this blog post, we will explore what this partnership means for customers, how it works, and what are the benefits of using Azure Databricks as part of the Microsoft Intelligent Data Platform.

What is Azure Databricks?

Azure Databricks is a cloud service that provides an optimized Apache Spark environment for data engineering, data science, and analytics. It is designed in collaboration with Databricks, whose founders started the Spark research project at UC Berkeley. Azure Databricks allows customers to set up their Spark clusters in minutes, autoscale them according to their needs, and collaborate on shared projects in an interactive workspace. Azure Databricks supports Python, Scala, R, Java, and SQL, as well as data science frameworks and libraries such as TensorFlow, PyTorch, and scikit-learn.

Azure Databricks also integrates seamlessly with other Azure services such as Azure Data Lake Storage, Azure Synapse Analytics, Azure Machine Learning, Azure Purview, and Azure DevOps. This enables customers to build end-to-end data pipelines and solutions that leverage the full potential of the cloud.

What is an open and governed data lakehouse?

A data lakehouse is a new paradigm for data management that combines the best aspects of a data lake and a data warehouse. A data lake is a centralized repository that stores raw and structured data in its native format. A data warehouse is a specialized system that organizes and optimizes data for analytical queries and reports.

A data lakehouse aims to provide the following advantages:

  • Open: A data lakehouse is built on open standards and formats such as Apache Parquet, Delta Lake, and Apache Spark. This ensures compatibility and interoperability with various tools and frameworks in the data ecosystem.
  • Governed: A data lakehouse provides a unified catalog and lineage for all the data assets in the lake. This enables data governance, security, quality, and compliance across the entire data lifecycle.
  • Performant: A data lakehouse leverages advanced techniques such as schema enforcement, indexing, caching, compaction, and partitioning to improve the performance and reliability of analytical queries on large-scale datasets.
  • Unified: A data lakehouse supports both batch and streaming workloads, as well as both structured and unstructured data. This enables customers to run all types of analytics workloads on a single platform without compromising on quality or efficiency.

How to set up metadata scanning in your organization in microsoft fabric

How does the partnership work?

Microsoft and Databricks have partnered to build an open and governed data lakehouse foundation in the Microsoft Intelligent Data Platform by integrating their hallmark capabilities to deliver an integrated solution for customers. The components of this solution include:

  • Azure Synapse Analytics: A cloud service that provides a unified experience for data warehousing, big data analytics, data integration, and AI. Azure Synapse Analytics allows customers to query both relational and non-relational data at petabyte scale using SQL or Spark. It also provides a serverless SQL pool that can directly query files in Azure Data Lake Storage without requiring any cluster or database provisioning.
  • Databricks SQL Analytics: A cloud service that provides a fast and easy way to run interactive SQL queries on data stored in Delta Lake. Delta Lake is an open source storage layer that brings reliability and performance to data lakes. It enables ACID transactions, schema enforcement, time travel, and upserts on Parquet files. Databricks SQL Analytics leverages Delta Engine, a vectorized query engine built on top of Apache Spark 3.0 that can run up to 8x faster than Spark SQL.
  • Azure Machine Learning: A cloud service that provides a comprehensive platform for building, training, deploying, and managing machine learning models. Azure Machine Learning allows customers to use automated machine learning to quickly identify suitable algorithms and hyperparameters. It also simplifies model management, monitoring, and updating across the cloud and the edge.
  • Azure Purview: A cloud service that provides a unified view of all the data assets across the enterprise. Azure Purview enables customers to discover, catalog, classify, and govern their data, as well as track its lineage and usage. Azure Purview also integrates with Databricks Unity Catalog, a unified metadata layer that synchronizes the data catalog and lineage between Azure Databricks and Azure Purview.

How to use Real-Time Analytics in Microsoft Fabric to stream and query data in near real-time

What are the benefits of using Azure Databricks?

Azure Databricks provides several benefits for customers who want to build modern, cloud-native analytics solutions. Some of these benefits are:

  • Reliability: Azure Databricks provides a fully managed and optimized Apache Spark environment that ensures high availability, fault tolerance, and data consistency. Customers do not have to worry about setting up, configuring, or tuning their Spark clusters, as Azure Databricks takes care of all the operational aspects.
  • Scalability: Azure Databricks allows customers to scale their Spark clusters up and down according to their workload demands. Customers can also take advantage of autoscaling and auto-termination features that improve the total cost of ownership (TCO) of their Spark clusters.
  • Productivity: Azure Databricks provides a collaborative workspace that enables customers to work on shared projects using notebooks, dashboards, and jobs. Customers can also use their preferred languages and frameworks, such as Python, Scala, R, SQL, TensorFlow, PyTorch, and scikit-learn. Azure Databricks also integrates with GitHub and Azure DevOps for version control and CI/CD.
  • Innovation: Azure Databricks provides access to the latest versions of Apache Spark and other open source libraries. Customers can also leverage the advanced capabilities of Azure Synapse Analytics, Azure Machine Learning, and Azure Purview to unlock insights from all their data and build AI solutions.

FAQs

Here are some frequently asked questions about Azure Databricks and the partnership with Microsoft.

Q: How can I get started with Azure Databricks?

A: You can get started with Azure Databricks by creating a pay-as-you-go account on Azure. You can also try Azure Databricks for free for 14 days by signing up [here].

Q: How much does Azure Databricks cost?

A: The pricing of Azure Databricks depends on the type and size of the Spark clusters you use, as well as the features you enable. You can find more details about the pricing [here].

Q: How can I migrate my existing Spark workloads to Azure Databricks?

A: You can migrate your existing Spark workloads to Azure Databricks by following the best practices and guidelines provided [here].

Q: How can I learn more about Azure Databricks and the partnership with Microsoft?

A: You can learn more about Azure Databricks and the partnership with Microsoft by visiting the following resources: