Custom Apache Spark Pools in Microsoft Fabric : In today’s data-driven world, organizations rely heavily on large-scale data processing and analytics to gain insights and make informed decisions. Apache Spark has emerged as a powerful and popular framework that enables data professionals to perform a wide range of tasks, including data transformation, machine learning, graph analysis, streaming, and more. However, effectively configuring and managing the compute resources required to run Spark jobs can be a complex and time-consuming endeavor, particularly when dealing with various workloads with diverse requirements.
To address these challenges, Microsoft Fabric has introduced a revolutionary cloud-based platform for data engineering and data science. This platform provides a fully managed Spark compute service, allowing you to create custom Spark pools tailored to your specific analytics workloads. With custom Spark pools, you can design compute environments that precisely match your requirements, ensuring optimal performance and resource utilization. Additionally, you can leverage features like autoscaling, dynamic executor allocation, and restorable availability, making your Spark experience more efficient and reliable.
In this comprehensive blog post, we will explore custom Spark pools in Microsoft Fabric, guiding you through the process of creating and using them for your data analysis tasks. We will also answer some frequently asked questions about custom Spark pools and provide useful external links for further learning.
Understanding Custom Spark Pools
Custom Spark pools serve as a way to inform Spark about the type of resources you need for your data analysis tasks. You can give your Spark pool a name and specify the number and size of nodes (the machines responsible for executing tasks). Moreover, you can instruct Spark on how to adjust the number of nodes based on workload variations.
The best part is that creating a custom Spark pool is cost-effective. You only pay when you execute a Spark job on the pool, and Spark takes care of provisioning the nodes for you. If your Spark pool remains unused for 2 minutes after your session expires, it will be deallocated. The default session expiration time period is set to 20 minutes, but you can adjust it according to your preferences.
For workspace administrators, the ability to create custom Spark pools for the entire workspace is a valuable feature. You can set a custom Spark pool as the default option for other users, saving time and eliminating the need to set up a new Spark pool each time you run a notebook or a Spark job.
It’s worth noting that custom Spark pools typically take about three minutes to start since Spark must obtain the necessary nodes from Azure.
Creating Custom Spark Pools
Let’s walk through the process of creating custom Spark pools in Microsoft Fabric:
- Access Workspace Settings: Start by going to your workspace and selecting “Workspace settings.”
- Choose Data Engineering/Science: Within the settings menu, expand the “Data Engineering/Science” option.
- Navigate to Spark Compute: From the left-hand menu, click on “Spark Compute.”
- Create a New Pool: Select “New Pool” to initiate the creation of a custom Spark pool.
- Pool Configuration: In the “Create Pool” menu, provide a name for your Spark pool. You can also select the node family and node size from the available options (e.g., Small, Medium, Large, X-Large, and XX-Large) based on your specific compute requirements.
- Set Minimum Node Configuration: You have the flexibility to define the minimum node configuration for your custom pools as 1. Fabric Spark ensures restorable availability for clusters with a single node, eliminating concerns about job failures, session loss during failures, or overpaying for smaller Spark jobs.
- Enable Autoscaling: Optionally, you can enable or disable autoscaling for your custom Spark pools. When autoscaling is active, the pool dynamically acquires new nodes up to the maximum node limit specified by the user and retires them after job execution. This ensures better performance by adjusting resources based on job requirements.
- Dynamic Executor Allocation: You can also choose to enable dynamic executor allocation for your Spark pool. This feature automatically determines the optimal number of executors within the user-specified maximum bound. It adjusts the number of executors based on data volume, resulting in improved performance and resource utilization.
- Autopause Duration: These custom pools come with a default autopause duration of 2 minutes.
You are allowed to size the nodes according to the capacity units purchased as part of the Fabric capacity SKU.
Using Custom Spark Pools
Once you have created your custom Spark pool, you can seamlessly utilize it for running notebooks or Spark jobs in Microsoft Fabric. Here’s how to do it:
Running a Notebook on a Custom Spark Pool:
- Navigate to your workspace and select “Notebooks” from the left-hand menu.
- Create a new notebook or open an existing one.
- In the notebook toolbar, select “Attach To” from the drop-down menu.
- Choose your custom Spark pool from the list of available options.
- Run your notebook cells as usual.
Running a Spark Job on a Custom Spark Pool:
- Access your workspace and select “Jobs” from the left-hand menu.
- Create a new job or open an existing one.
- In the job definition page, select “Edit” from the top-right corner.
- In the “Job Settings” section, pick your custom Spark pool from the Spark Pool drop-down menu.
- Save your job definition and run it as you normally would.
Frequently Asked Questions
Let’s address some common questions related to custom Spark pools:
Q: What is the difference between custom Spark pools and starter pools?
A: Starter pools offer a fast and convenient way to use Spark on the Microsoft Fabric platform within seconds. With starter pools, you can use Spark sessions immediately, without waiting for Spark to set up nodes, allowing you to work with data and gain insights more quickly. Starter pools have Spark clusters that are always on and ready for your requests. They utilize medium nodes that dynamically scale up based on your Spark job requirements. Starter pools also come with default settings that expedite library installation without slowing down session start times.
In contrast, custom Spark pools are tailored compute environments created based on specific requirements. Users can choose the node family, node size, minimum and maximum nodes, autoscaling, and dynamic executor allocation options for custom Spark pools. Custom Spark pools take approximately three minutes to start as Spark needs to obtain nodes from Azure.
Q: How much does it cost to use custom Spark pools?
A: The cost of using custom Spark pools is based on the compute resources used during the execution of Spark jobs. Pricing depends on factors such as node family, node size, and the duration of your Spark job. You can find detailed information about Microsoft Fabric’s pricing and billing here.
Q: How can I monitor and troubleshoot my custom Spark pools?
A: Microsoft Fabric provides two essential tools for monitoring and troubleshooting custom Spark pools:
- Fabric Monitor: This tool allows you to view the status, metrics, logs, and events associated with your custom Spark pools.
- Fabric Diagnostics: Use this tool to diagnose and resolve common issues that may arise with your custom Spark pools.
You can find comprehensive instructions on how to use these tools here.
Further Learning
If you’re eager to delve deeper into the world of custom Spark pools and Microsoft Fabric, consider exploring the following resources:
- Create custom Apache Spark pools in Fabric – Microsoft Fabric
- Spark compute for Data Engineering and Data Science – Microsoft Fabric
Custom Spark pools in Microsoft Fabric provide a flexible and efficient solution for managing Spark resources and enhancing your data engineering and data science projects. Whether you’re a data professional, data scientist, or a business looking to extract insights from your data, custom Spark pools can streamline your workflows and help you make the most of your data analytics capabilities.