Databricks data apps with GitHub actions

Kaarel Kõrvemaa | Oct 27, 2024 min read

Data apps and Deployment with GitHub Actions


Databricks data apps are now in public preview available. This example can be used for all the app deployment. The processes itself doesn’t take into account what apps are in question. Databricks supports the following frameworks: Dash, Flask, Gardio, Shiny and Streamlit.

This blog is about how to deploy apps, not so much about how to develop the apps. This example is Azure related and can be different for AWS and GCP.

Pre tasks

Before you begin, ensure you have the following:

  • A GitHub repository
  • Databricks workspace
  • Service principal for the ci/cd
  • Sql enpoint

Create a service principal for the deployment. Depending on your Databricks setup, you can either create an Entra or Databricks managed(obo) service principal.

Detailed instructions can be found from here

Github CI/CD

If you want to jump to the code, the code can be found from here

1. Install the databricks cli

This step uses the actions/checkout action to clone the repository into the workflow runner. This is necessary for the subsequent steps to access the code and configuration files in the repository.

  curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

2. Get Databricks Access Token using Service Principal

This step installs the Databricks CLI, which is a command-line tool for interacting with the Databricks platform. The CLI is essential for performing various operations such as syncing files and managing apps. In this case you can also limit the lifetime of the token and what apis you want to access, we will access all.

    - name: Get Databricks Access Token using Service Principal
      run: |
        export CLIENT_ID=${{ secrets.DATABRICKS_CLIENT_ID_CICD }}
        export CLIENT_SECRET=${{ secrets.DATABRICKS_CLIENT_SECRET_CICD }}
        export TOKEN_ENDPOINT_URL="${{ secrets.DATABRICKS_HOST }}/oidc/v1/token"
        
        DATABRICKS_TOKEN=$(curl --request POST \
          --url "$TOKEN_ENDPOINT_URL" \
          --user "$CLIENT_ID:$CLIENT_SECRET" \
          --data 'grant_type=client_credentials&scope=all-apis' | jq -r .access_token)
        
        echo "DATABRICKS_TOKEN=$DATABRICKS_TOKEN" >> $GITHUB_ENV        

3. Create Databricks Directory for Apps

This not necessary if you want to use the direct folders from users that are connected with databricks.

This step syncs the local ./apps/ directory with the /apps directory in Databricks. This ensures that the latest version of the applications is uploaded to Databricks.

There can be as several apps, the limit is 10 apps per/workspace, and they can be any of the supported apps mentioned in the beginning. The deployment pipeline doesn’t check the app framework.

    - name: Create Databricks Directory for apps
      run: |
        RESPONSE=$(curl --request POST \
          --url "${{ secrets.DATABRICKS_HOST }}/api/2.0/workspace/mkdirs" \
          --header "Authorization: Bearer $DATABRICKS_TOKEN" \
          --header "Content-Type: application/json" \
          --data '{"path": "/apps"}')

        echo "response=$RESPONSE" >> $GITHUB_ENV
        echo "$RESPONSE"        

4. Create Databricks App if Not Exists

databricks sync ./apps/ /apps –full: This command uses the Databricks CLI to synchronize the local ./apps/ directory with the /apps directory on Databricks. The –full flag indicates that a full synchronization should be performed, ensuring that the contents of the local directory are fully mirrored to the remote directory.

    - name: Databricks sync
      env:
        DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
        DATABRICKS_TOKEN: ${{ env.DATABRICKS_TOKEN }}
      run: |
        echo "Checking out the releases branch"
        databricks -v 
        databricks sync ./apps/ /apps --full         

5. Create Databricks App if Not Exists

This step in the GitHub Actions workflow ensures that Databricks apps are created if they do not already exist. It performs the following actions:

Environment Variables :

  • DATABRICKS_HOST: The Databricks host URL, retrieved from GitHub secrets.
  • DATABRICKS_TOKEN: The Databricks access token, retrieved from the environment variables.

Loop Through Directories :

  • The script iterates over each directory inside the apps directory.
  • It skips hidden directories (those starting with a dot).

Create App :

  • For each app directory, it attempts to create a new Databricks app using the Databricks API.
  • If the app already exists, it skips the creation process.
  • It will also allow the app to use the Sql endpoint and app service principal to permissions

Check App Status :

  • If the app is newly created, the script waits until the app’s status becomes ACTIVE.
  • It checks the app status every 30 seconds.

Deploy App :

  • Once the app is ACTIVE, it deploys the app using the databricks apps deploy command.

    - name: Create Databricks app if not exists
      env:
        DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
        DATABRICKS_TOKEN: ${{ env.DATABRICKS_TOKEN }}
        SQL_WAREHOUSE: ${{ secrets.SQL_WAREHOUSE }}
        
      run: |

        for app_dir in $(find apps -mindepth 1 -maxdepth 1 -type d); do
          app_name=$(basename "$app_dir")
          if [[ $app_name == .* ]]; then
            echo "Skipping hidden directory: $app_name"
            continue
          fi
          echo "Processing app: $app_name"

          RESPONSE=$(curl --request POST \
          --url "${{ secrets.DATABRICKS_HOST }}/api/2.0/apps" \
          --header "Authorization: Bearer $DATABRICKS_TOKEN" \
          --header "Content-Type: application/json" \
          --data "{\"name\": \"$app_name\", \"default_source_code_path\": \"/apps/$app_name\", \"resources\": [{\"sql_warehouse\": {\"id\": \"$SQL_WAREHOUSE\", \"permission\": \"CAN_MANAGE\"}}]}")

        
          echo "response=$RESPONSE" >> $GITHUB_ENV
          echo "Create app $RESPONSE"

          if echo "$RESPONSE" | grep -q "ALREADY_EXISTS"; then
            echo "App $app_name already exists. Skipping creation."
          else
            # Wait until the app is in the ACTIVE state
            while true; do
              APP_STATUS=$(curl --request GET \
                --url "${{ secrets.DATABRICKS_HOST }}/api/2.0/apps/$app_name" \
                --header "Authorization: Bearer $DATABRICKS_TOKEN" | jq -r .compute_status.state)
              
              if [[ "$APP_STATUS" == "ACTIVE" ]]; then
                echo "App $app_name is ACTIVE"
                break
              fi
              
              echo "Waiting for app $app_name to become ACTIVE. Current state: $APP_STATUS"
              sleep 30
            done
          fi
          #This will make sure that the app will update
          databricks apps deploy $app_name --source-code-path /Workspace/apps/$app_name

        done

Things to do for production

** Access management

When you create an app, it needs access to sql endpoint for accessing data

The service principals that are automatically created in the deployment should be added to a relevant group. This is for accessing data.

There is also an option to give access to everyone in the organisation.

Example apps

The example apps are all the same. Streamlit apps show how to use editable dataframes.

NB! If you want the app to work you need to add the cluster id and also create a catalog app_dev, schema default and people. Also add service principal to access the underlying data. For some reason the varibles don’t work correctly.

Conclusions

The apps and the APIs are still in preview, so they can give you some grey hair. I think having a CI/CD pipeline ready for developing apps makes the development process much easier, both from the testing and quality points of view. Access management will require a lot of work if you have many workspaces and networking restrictions between different catalogs.