How to Configure Lake House in Copy Activity

Configure Lake House in Copy Activity : Lake house, the cutting-edge data platform that seamlessly merges data lakes and data warehouses, offers a unified space for data management. With support for open formats, standards, batch, and streaming data ingestion, schema evolution, transactional consistency, and robust SQL queries, the lake house empowers organizations with advanced analytics capabilities. Microsoft Fabric, a cloud-based platform for data engineering and data science, introduces the Microsoft Fabric Lake House, a powerful data analytics engine that simplifies and accelerates data analytics at scale.

In this comprehensive blog post, we will walk you through the process of configuring the lake house within the context of the Copy Activity, a core component of Microsoft Fabric Data Factory. Copy Activity provides rich data transformation capabilities, enabling you to copy data between sources, destinations, and even within the lake house itself. Whether you need to move data from various sources to your lake house, from the lake house to diverse destinations, or perform advanced tasks like incremental loading, upsert loading, schema drift handling, compression, encryption, and partitioning, Copy Activity has you covered.

Prerequisites

Before diving into configuring the lake house in Copy Activity, ensure you have the following prerequisites in place:

  1. Microsoft Fabric Account: Sign up for a Microsoft Fabric account if you don’t have one.
  2. Microsoft Fabric Workspace: Create a Microsoft Fabric workspace to serve as your data hub.
  3. OneLake Account: Obtain a OneLake account, a unified data lake service that offers secure, scalable, and cost-effective data storage. Sign up for a free tria.
  4. OneLake Storage Account: Create a OneLake storage account to store your data securely.
  5. Microsoft Fabric Lake House: Build a Microsoft Fabric Lake House to provide a swift data analytics engine.
  6. Microsoft Fabric Data Factory: Establish a Microsoft Fabric Data Factory to create and manage your data pipelines.

Revolutionizing Data Warehousing with Microsoft Fabric

Create a Copy Pipeline

A copy pipeline serves as a logical container for one or more copy activities. These activities work together to accomplish specific data-related tasks. To create a copy pipeline, follow these steps:

  1. In your Microsoft Fabric workspace, navigate to the Data Factory tab and select your Data Factory.
  2. Within the Data Factory page, access the Author tab and click on Pipelines.
  3. Select “New pipeline” and provide a name for the pipeline, such as “CopyPipeline.”
  4. In the Activities panel, expand “Move & transform” and drag the “Copy Data” activity to the canvas.
  5. Select the copy activity and assign it a name, e.g., “CopyActivity.”

Configure the Source

The source represents the data store from which you want to copy data. To configure the source, follow these steps:

  1. In the copy activity settings, navigate to the Source tab.
  2. Click on “New” in the Source dataset field to create a new dataset.
  3. On the New dataset page, select “Workspace” as the Data store type.
  4. Choose “Lake House” as the Workspace data store type.
  5. Click “Continue.”
  6. Provide a name for the dataset, such as “SourceDataset.”
  7. In the Connection tab, select your Lake House from the drop-down list.
  8. Choose between “Tables” or “Files” as the Root folder, depending on whether you intend to copy from a table or a file in your Lake House.

If you select Tables:

  • Choose an existing table from the “Table name” drop-down list, or enter the table name manually.
  • Optionally, you can specify a Timestamp or a Version to query an older snapshot of the table.
  • Optionally, you can add Additional columns to include the relative path or static value of the source files in the output.

If you select Files:

  • Select “File path,” “Wildcard file path,” or “List of files” as the File path type, depending on how you want to specify your source files.

If you select File path:

  • Click on “Browse” and select a file from your Lake House unmanaged area (under Files), or enter a file path manually.

If you select Wildcard file path:

  • Enter a folder or file path with wildcard characters (*) under your Lake House unmanaged area (under Files) to filter your source files.
  • Optionally, enter a Wildcard file name to further filter your source files by name.

If you select List of files:

  • Enter one or more file paths under your Lake House unmanaged area (under Files) to specify your source files.
  1. Click on “OK.”

Configure the Destination

The destination represents the data store to which you want to copy data. To configure the destination, follow these steps:

  1. In the copy activity settings, go to the Sink tab.
  2. Click on “New” in the Sink dataset field to create a new dataset.
  3. On the New dataset page, select “Workspace” as the Data store type.
  4. Choose “Lake House” as the Workspace data store type.
  5. Click “Continue.”
  6. Provide a name for the dataset, such as “SinkDataset.”
  7. In the Connection tab, select your Lake House from the drop-down list.
  8. Choose between “Tables” or “Files” as the Root folder, depending on whether you intend to copy to a table or a file in your Lake House.

If you select Tables:

  • Choose an existing table from the “Table name” drop-down list, or enter the table name manually.
  • Optionally, you can specify a Pre-copy script to run before copying data to the table.
  • Optionally, you can specify a Post-copy script to run after copying data to the table.

If you select Files:

  • Enter a folder or file path under your Lake House unmanaged area (under Files) to specify your destination files.
  • Optionally, you can specify a File name to overwrite the destination file name.
  1. Click on “OK.”

Configure the Mapping

The mapping defines how the data is copied from the source to the destination. To configure the mapping, follow these steps:

  1. In the copy activity settings, navigate to the Mapping tab.
  2. Click on “Import schemas” to automatically generate the mapping based on the source and sink schemas.
  3. Optionally, you can edit the mapping by adding, removing, or modifying the columns and their properties.
  4. Optionally, you can enable Schema drift handling to handle any changes in the source schema at runtime.

Revolutionizing Data Warehousing with Microsoft Fabric

Configure the Settings

The settings provide additional options for controlling the data copying process from the source to the destination. To configure the settings, follow these steps:

  1. In the copy activity settings, go to the Settings tab.
  2. Optionally, you can specify a Stage location to stage your data in an intermediate storage before copying to the destination. This can improve performance and reliability for some scenarios.
  3. Optionally, you can enable Skip incompatible rows to skip any rows that are incompatible with the destination schema and log them in a separate file.
  4. Optionally, you can enable Fault tolerance to handle any errors or failures during the copy operation and log them in a separate file.
  5. Optionally, you can enable Compression or Decompression to compress or decompress your data during the copy operation.
  6. Optionally, you can enable Encryption or Decryption to encrypt or decrypt your data during the copy operation.
  7. Optionally, you can enable Partitioning to partition your data by date or size during the copy operation.
  8. Optionally, you can enable Incremental loading or Upsert loading to load only new or updated data from the source to the destination.

Run and Monitor the Copy Pipeline

To run and monitor the copy pipeline, follow these steps:

  1. In your Microsoft Fabric workspace, go to the Data Factory tab and select your Data Factory.
  2. Within the Data Factory page, access the Author tab and select Pipelines.
  3. Select your copy pipeline and click on “Debug” to run it in debug mode.
  4. Wait for the pipeline run to complete and check for any errors or warnings in the Output panel.
  5. To monitor your pipeline runs in more detail, go to the Monitor tab and select Pipeline runs.

In this blog post, we’ve demonstrated how to configure the lake house in a copy activity using Microsoft Fabric Data Factory. This technique allows you to seamlessly move data from various sources to the lake house, between different destinations, or within the lake house itself.

External Links:

FAQs:

Q: How can I troubleshoot errors during data copying using Copy Activity?
A: You can utilize the built-in monitoring and diagnostic tools in Microsoft Fabric Data Factory to troubleshoot errors during data copying. Additionally, you can refer to the official documentation for detailed guidance.

Q: What are the benefits of using a lake house architecture?
A: A lake house architecture offers a unified platform that combines the advantages of data lakes and data warehouses. It allows you to manage data in one place, supports various data formats, and facilitates advanced analytics while providing transactional consistency.

Q: Can I automate copy pipelines in Microsoft Fabric Data Factory?
A: Yes, you can automate copy pipelines using scheduling, triggers, and event-driven workflows in Microsoft Fabric Data Factory. This enables you to perform data copying tasks at specified intervals or in response to events.

Q: Are there any best practices for optimizing data copying performance in Copy Activity?
A: To optimize data copying performance, consider factors such as data partitioning, parallelization, and appropriate data movement techniques. Microsoft Fabric provides performance tuning recommendations in its documentation.