CodewCaro.

Building scalable Github actions for Data Platform pipelines

Caroline Cah
Caroline Cah

In this post, i'll write about my experience of how to set up an efficient and scalable GitHub Actions workflows for deploying data pipelines on your data platform. The goal is to enable teams to easily add data to the platform using a self-service approach, while maintaining control over deployment processes and environments.


Why is this important?


This is important for ensuring scalability, allowing teams to manage and deploy pipelines independently without disruption. It fosters team autonomy and collaboration, enabling parallel work while maintaining consistency through shared infrastructure. By separating environments (dev/prod) and automating checks, it reduces the risk of untested code reaching production. The efficiency of CI/CD workflows speeds up deployment and validation, while security and compliance are enhanced through centralized controls and access restrictions, protecting critical environments and enforcing best practices. These components are key elements for growing data platforms.


What is the problem we're trying to solve here?


Our team needs to create a Proof of Concept (POC) for deploying data pipelines to the data platform. The desired future use case looks something like this:


Use Case: Team "Factory systems" wants to add data into the data platform. They deploy their Terraform code into our GitHub Actions flow, which checks the code. If no errors are found, the data pipeline is created, and the data is stored in the data platform.


In order to achieve this we need to consider the following questions:


Should we create one repo per data pipeline or one big repo for the entire data platform?


Should we have seperate repos for dev and prod environments or manage them with a single repo?


Repository structure: Monorepo vs. Polyrepo


One of the first decisions we need to make is whether to use a monorepo (a single repository for the entire data platform) or a polyrepo (one repository per data pipeline). Things to consider in this:


Monorepo: one big repo for the data platform


With a monorepo, all infrastructure code and data pipelines are in one place. The pros with this is centralized control, it is easy to enforce RFC's and policies across the entire platform. It is easy to manage common resources and components shared by multiple teams. Lastly the full scope of the platform is visible to everyone.


However, as the project scales there will be a number of pipelines, the repo can become bloated and slow. Slower CI/CD, changes done by one single team member most likely will trigger a full pipeline which can make the development cumbersome. RBAC is also challenged since it becomes difficult to restrict access to specific teams or component.


When to use monorepos


When teams are small, work closely together, and pipelines are relatively simple. If there are reusable domain languages that heavily relies on eachother a monorepo might make sense.


Polyrepo: one repo per data pipeline


With a polyrepo, each data pipeline has its own repository, while shared infrastructure (such as Terraform modules) lives in a separate repo.


This is a highly scalable approach since each pipeline is independent, so adding new ones doesn't increase repo complexity. Since there are seperate pipelines CI/CD also runs isolated which makes deployments faster. It might be satisfying, but who wants to wait for those 375 unit tests to tick green? Team autonomy is another huge benefit of this solution. Teams can manage their own pipeline repos without affecting others. RBAC can be utilized since repo access is easily managed in GitHub.


The downsides then? Duplication, it's harder to enforce consistent standards across multiple repos. Complexity, managing multiple repos can add overhead, especielly when dealing with shared component.


When to use polyrepos


When having multiple streams/teams a large number of pipelines or complex individual pipelines requirements. It provides better scalability and flexibility longterm.


My recommendation


The best of both worlds is to use a hybrid approach.


Monorepo for platform infrastructure where you can manage the core infra (Terraform modules, data platform provisioning).


Polyrepo for individual datapipelines where each pipelines has its own repo allowing teams to manage their pipelines independently.


Managing dev, staging and prod. Seperate or one repo?


The next question is whether to manage development and production environments in separate repos or within the same repo.


Seperate repos for dev, staging and prod


When managing separate repositories for dev, staging, and prod environments, each environment gets its own dedicated repo. This approach allows teams to keep the code and infrastructure configurations for different environments isolated from one another, reducing the risk of accidental deployments and improving control over each environment.


The benefits of separate repos is the environment isolation where dev, staging and prod can be independently managed. This isolation reduces the risk of introducing breaking changes from one environment into another. For instance developers working on the dev environment can push experimental features without worrying about affecting the production or staging environment.


GitHubs access control mechanisms can be more granular for this choice. Since each environment has it's own repo it is easy to manage access permissions seperately for each environment. This means the only specific team members, such as DevOps teams can write access to production, while other developers may only have access to dev or staging.


There are however many challenges of seperate repos. The increased overhead for enigineers with managing multiple repos. Creating seperate CI/CD pipelines and maintenance can increase the administrative effort, especielly when ensuring consistency across environments. When working with separate repositories, there’s a need for more coordination between teams to ensure that changes made in dev are properly promoted to staging and then to production. Without careful planning, this could lead to inconsistencies across environments if different versions of code or infrastructure are deployed. Also the obvious DRY (Don't repeat yourself) where duplication of code across repos just doesn't make sense.


Final Recommendation


If your team is looking for scalability and the ability to work autonomously while ensuring infrastructure consistency, the hybrid approach is still the best bet. It strikes a balance by maintaining shared infrastructure in a monorepo while giving teams control over their specific pipelines in polyrepos. For larger projects or teams that need strict control over production environments, consider using separate repositories for production while combining dev and staging to reduce overhead​ ()​().


This hybrid model scales well, particularly when paired with strong CI/CD practices like environment-specific workflows, branch protection rules, and role-based access control (RBAC). However, always tailor the approach based on your specific team size, workflow complexity, and governance requirements.


Use monorepos for shared infrastructure (Terraform) to centralize infrastructure management and ensure consistency. Use polyrepos for individual pipelines, allowing teams to work autonomously while maintaining scalability.


Show me the code!!


For a monorepo for shared infrastructure (Terraform modules in this example). You can set up environment specific workflows to apply configs based on your branches or directories. Here's an example 🙂


GitHub Actions Workflow (.github/workflows/infrastructure.yml):


name: Easy peasy lemon squeecy deploy infrastructure

on:
  push:
    branches:
      - dev
      - staging
      - prod

jobs:
  plan:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout repository
        uses: actions/checkout@v2

      - name: Set up Terraform
        uses: hashicorp/setup-terraform@v1

      - name: Terraform Init
        run: terraform init

      - name: Terraform Plan
        run: terraform plan -var-file="${{ github.ref_name }}.tfvars"

  apply:
    needs: plan
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/prod'

    steps:
      - name: Checkout repository
        uses: actions/checkout@v2

      - name: Set up Terraform
        uses: hashicorp/setup-terraform@v1

      - name: Terraform Apply
        run: terraform apply -auto-approve -var-file="prod.tfvars"

In this little snippet the workflows runs dev, staging and prod branches (duh). Each environment have its own .tfvars file and the Terraform apply command only runs on the prod branch to ensure that only production changes are applied when code is merged into production.


Code example for Polyrepo individual data pipelines


Each data pipeline can have its own repository and CI/CD pipeline for deploying independent. Here's a code example for a data pipeline in a polyrepo that triggers based on the branch being pushed. This one is in python. Mainly because Python is commonly used for data pipelines.


GitHub Actions Workflow (.github/workflows/deploy-pipeline.yml):


name: Push it to the limit data pipeline

on:
  push:
    branches:
      - dev
      - staging
      - main  # Main is considered prod here

jobs:
  test:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout repository
        uses: actions/checkout@v2

      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.x'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run unit tests
        run: pytest

  deploy:
    runs-on: ubuntu-latest
    needs: test
    steps:
      - name: Checkout repository
        uses: actions/checkout@v2

      - name: Set up Terraform
        uses: hashicorp/setup-terraform@v1

      - name: Terraform Init
        run: terraform init

      - name: Terraform Apply
        run: terraform apply -auto-approve -var-file="${{ github.ref_name }}.tfvars"

The test job ensures that unit tests are ran before deploying. The deployment is then triggered based on the branch (dev, staging or main).


Using databricks on Azure


Databricks + Azure are common platform choices when creating a data platform. Terraform is used to provision and manage infra using config files. Hashicorp offers the azurerm Terraform provider, which allows you to manage Azure resources. The Databricks Terraform provider offers support to manage Databricks resources directly, for example "Clusters", "Jobs" and "Notebooks".


You can integrate Terraform with GitHub Actions or Azure Pipelines to automate your Databricks deployments. For example, GitHub Actions can trigger the Terraform workflow to provision resources in Azure and Databricks based on new code changes, allowing for continuous integration and deployment of infrastructure.

More posts