Deploying Trino with Apache Ranger and Superset on Kubernetes

15 min readJul 15, 2024

Introduction

Trino, a high-performance distributed SQL query engine, offers unparalleled flexibility in accessing and analyzing data from various sources. However, as data volumes grow and become increasingly sensitive, ensuring security and proper authorization becomes paramount. Apache Ranger, a comprehensive access control framework, provides a robust solution to these challenges.

This tutorial will guide you through the process of deploying Trino with Apache Ranger, and Apache Superset on Kubernetes, configuring their integration, and establishing effective access control policies.

Understanding the components

Trino: A distributed SQL query engine that can query data from various sources. A Trino cluster consists of a coordinator and many workers.

Apache Ranger: An open-source framework for centralized policy management that we will use for Trino access control.

Apache Ranger Admin: the core component of the Apache Ranger framework, provides an UI for setting up policies for administrators to define and manage access control policies. Ranger Admin allows for the creation, modification, and deletion of users, groups, and roles. This is crucial for defining the subjects that interact with the data. It also collects audit logs and provides a centralized place where we can browse through those logs.
Ranger Plugin: is installed on Trino Coordinator and plays the role of keeping policies on the Trino cluster in sync with the policies defined on Ranger Admin.
Opensearch: used as an audit-logs storage for Apache Ranger.
Postgres: used as the database for storing policies, users, groups, roles, etc. for Apache Ranger.

Apache Superset: provides a rich and intuitive interface for users to explore and visualize data. In this tutorial, we will connect Superset to Trino and try to query data on SQL Lab.

By combining Trino, Apache Ranger, and Apache Superset, we can create a robust data platform that empowers users to explore data securely and efficiently.

Pre-requisites

Kubernetes: to follow this tutorial, you must have a Kubernetes environment set up (EKS, GKE, K3s, etc.). For the sake of simplicity, we will deploy the stack using K3s on a single machine. You can follow the official documentation from the K3s website here to install a Kubernetes cluster. By default, K3s is bundled with Traefik ingress controller, we will use Traefik to expose the cluster to the users. If you choose to use EKS, or GKE as the deployment environment, the configuration for the ingress component will be different from the one in this tutorial.
kubectl & Helm CLI: make sure you have Helm CLI and kubectl installed on your machine.
K9s (optional): a powerful, terminal-based user interface designed to simplify the management and observation of Kubernetes clusters.
A Cloudflare account and a registered domain with an SSL certificate.

Resources

You can find all the custom values files and Dockerfile for custom images used to deploy this stack in this repository. Now let’s start our journey!

Deploying components

Step 0: Creating a new namespace on Kubernetes

We will start our journey by creating a separate namespace in our Kubernetes cluster for deploying all components in our stack.

kubectl create namespace trino-cluster

Step 1: Deploying Postgres

Next, we need to deploy Postgres as a database for Apache Ranger and Superset. Please bear in mind that to keep this tutorial simple, we will use a single standalone Postgres instance hosted on Kubernetes for both Apache Ranger and Apache Superset. In addition, I will also skip the configuration for TLS and specify some credential information directly in the values file when setting up our Postgres instance. If you want to run this stack in production, please consider all of these for enhancement.

We will use Bitnami Helm Chart to deploy our Postgres instance, you can find more information about this Helm Chart here.

image:
  registry: docker.io
  repository: bitnami/postgresql
  tag: "16.3.0"
  pullPolicy: IfNotPresent
auth:
  enablePostgresUser: true
  database: "superset" # Create a new database used by Superset later on
  username: "superset" # Create a new user used by Superset later on
  password: "super-secret-password-superset-admin" # Change password used by "superset" user
  postgresPassword: "super-secret-password-postgres" # Change password used by "postgres" user
architecture: standalone
containerPorts:
  postgresql: 5432
postgresqlDataDir: /bitnami/postgresql/data
primary:
  name: primary
  resourcesPreset: "large"
  service:
    type: ClusterIP
    ports:
      postgresql: 5432
  persistence:
    enabled: true
  persistentVolumeClaimRetentionPolicy:
    enabled: true
    whenScaled: Retain
    whenDeleted: Retain

You can locate this file in deployment/postgres/values.yaml path. All you need to change is the auth section containing the credentials used to authenticate to Postgres. This Postgres database doesn’t need to be exposed to the outside world, so we will only need to configure a ClusterIP service for it.

Once the above changes are made to the values.yaml file we can now install our Postgres on our Kubernetes environment.

helm install postgres-trino oci://registry-1.docker.io/bitnamicharts/postgresql -f deployment/postgres/values.yaml -n trino-cluster

After that, execute the below command to deploy an Opensearch instance.

helm install opensearch-trino opensearch/opensearch -f deployment/opensearch/values.yaml -n trino-cluster

Opensearch has been successfully deployed

Step 3: Deploying Ranger Admin

After having Postgres and Opensearch successfully deployed, we can now deploy the Ranger Admin component. Apache Ranger doesn’t have an official Helm Chart for deploying on Kubernetes, therefore I have to create a Helm Chart just for that. You can find more information about the Helm Chart here.

You can use my pre-built Docker image for Ranger Admin or re-build the image by yourself using the Dockerfile in docker-images/ranger/ folder. You can find my pre-built image here.

rangerAdmin:
  image:
    repository: qnguyen3496/ranger-admin
    tag: 3.0.0-SNAPSHOT
    pullPolicy: IfNotPresent
  installPropertiesSecret: ranger-admin-secret # Needed to be created separately
  resources:
    requests:
      cpu: 1
      memory: 2Gi
    limits:
      cpu: 4
      memory: 6Gi
service:
 rangerAdmin:
  type: ClusterIP
   port: 6080
ingress:
  rangerAdmin:
    enabled: true
    className: traefik
    annotations:
      traefik.ingress.kubernetes.io/router.entrypoints: web
    hosts:
      - host: ranger.example.com # Change this to your domain name for the ranger-admin service
        paths:
          - path: /
            pathType: Prefix

You can locate this file in deployment/ranger-admin/values.yaml path. Since we need to access the Ranger Admin UI from our web browser, we need to expose it using Traefik ingress. Please change the host value to an appropriate value. In this tutorial, I used cloudflared from Cloudflare to expose the service to the outside world.

Besides this values file, we need to create a Kubernetes secret that stores the configuration for Ranger Admin, this secret will be created from deployment/ranger-admin/config/install.properties file. There is some information in this file that you need to change.

...
db_root_user=postgres # Root user of the Postgres database
db_root_password=super-secret-password-postgres # Root password of the Postgres database
db_host=postgres-trino-postgresql-hl # Hostname of the Postgres database (use Postgres headless service name in this tutorial)
...
db_name=ranger # New database will be created on Postgres using root user credential
db_user=rangeradmin # New user will be created on Postgres using root user credential
db_password=super-secret-password-postgres-for-ranger # Password for new user
...
rangerAdmin_password=superSecretPassword1! # Password for admin user of Ranger Admin
rangerTagsync_password=superSecretPassword2! # Password for Tagsync component (not used in this tutorial)
rangerUsersync_password=superSecretPassword3! # Password for Usersync component (not used in this tutorial) 
keyadmin_password=superSecretPassword4! # Password for key admin
...
audit_store=elasticsearch # Both Elasticsearch and Opensearch use this configuration
audit_elasticsearch_urls=opensearch-cluster-master-headless # Hostname of the Opensearch instance
audit_elasticsearch_port=9200
audit_elasticsearch_protocol=http
audit_elasticsearch_user=admin # Username of the Opensearch instance
audit_elasticsearch_password= # Password of the Opensearch instance (leave it blank since we disable the security plugin)
audit_elasticsearch_index=ranger_audits
audit_elasticsearch_bootstrap_enabled=true

After changing the necessary configuration from that file, we need to create a Kubernetes secret from it.

kubectl create secret generic ranger-admin-secret --from-file=deployment/ranger-admin/config/install.properties -n trino-cluster

Note: the name of the secret must be identical to the installPropertiesSecret in the values file above.

To deploy the Ranger Admin component, execute the bellow command.

helm install ranger-admin https://github.com/vinhhap/apache-ranger-chart/releases/download/apache-ranger-0.0.1/apache-ranger-0.0.1.tgz -f deployment/ranger-admin/values.yaml -n trino-cluster

You may need to wait up to 3 minutes for the Ranger Admin component to start for the first time.

Ranger Admin has been successfully deployed

Now let’s open our web browser and head to the URL specified in the values file above.

The default administrator user is admin , you need to use the password specified at rangerAdmin_password in install.properties file above to login.

Step 4: Deploying Trino

Now let’s deploy the core component of our stack, which is Trino. Trino has an official Docker image, which in normal circumstances, can be used to set up a cluster on Kubernetes. However, since we need to integrate Trino with Apache Ranger, it requires us to install the Ranger plugin alongside with Trino coordinator. We need to create a customized version of the Trino image, the Dockerfile used to create the customized image can be found at docker-images/trino. You can also use my pre-built Docker image here. Although running a container with root user is not considered a good practice and also can create some security concerns, please be aware that the Ranger plugin requires to be run as root user.

We will use the official Trino Helm Chart to deploy our Trino Cluster, you can find more information about this Helm Chart here.

image:
  repository: qnguyen3496/trino-ranger
  tag: "452"
  pullPolicy: IfNotPresent

securityContext:
  runAsUser: 0 # Ranger plugin required to be run as `root` user
  runAsGroup: 0 # Ranger plugin required to be run as `root` group
  
server:
  workers: 1 # Number of Trino workers
  node:
    environment: dev_trino # Name of the Trino environment
  log:
    trino:
      level: INFO
  config:
    path: /tmp/trino # Temporary path for Trino configurations before being copied to `/etc/trino`
    http:
      port: 8080
    authenticationType: "PASSWORD"
    query:
      maxMemory: "4GB"
  autoscaling:
    enabled: false

additionalConfigProperties:
  - internal-communication.shared-secret=super-secret-communication-shared-secret # Shared secret to authenticate all communication between nodes of the cluster
  - http-server.process-forwarded=true # Needed when Trino cluster is behind a load balancer or proxy server
  - hide-inaccessible-columns=true

additionalCatalogs: # Configure additional catalogs for Trino besides TPCH, TPDS
  jmx: |
    connector.name=memory
  memory: |
    connector.name=memory
    memory.max-data-per-node=128MB

envFrom: []

service:
  type: ClusterIP
  port: 8080

auth:
  passwordAuth: "admin:encrypted-password-with-htpasswd" # User/password for Trino (password must be securely hashed using bcrypt or PBKDF2)

coordinator:
  jvm:
    maxHeapSize: "8G"
    gcMethod:
      type: "UseG1GC"
      g1:
        heapRegionSize: "32M"
  config:
    memory:
      heapHeadroomPerNode: ""
    query:
      maxMemoryPerNode: "1GB"
  additionalJVMConfig:
    - -XX:+UnlockDiagnosticVMOptions
    - -XX:G1NumCollectionsKeepPinned=10000000
  resources:
    limits:
      cpu: 4
      memory: 6Gi
    requests:
      cpu: 1
      memory: 2Gi
  livenessProbe:
    initialDelaySeconds: 20
    periodSeconds: 10
    timeoutSeconds: 5
    failureThreshold: 6
    successThreshold: 1
  readinessProbe:
    initialDelaySeconds: 20
    periodSeconds: 10
    timeoutSeconds: 5
    failureThreshold: 6
    successThreshold: 1
  additionalConfigFiles:
   # Specify this node is the coordinator
    is_coordinator.txt: |
      true
  secretMounts:
   # Ranger plugin configuration secret mounted as a volume
    - name: trino-ranger-plugin-config-secret
      secretName: trino-ranger-plugin-config-secret
      path: /tmp/ranger_plugin_config/

worker:
  jvm:
    maxHeapSize: "8G"
    gcMethod:
      type: "UseG1GC"
      g1:
        heapRegionSize: "32M"
  config:
    memory:
      heapHeadroomPerNode: ""
    query:
      maxMemoryPerNode: "1GB"
  additionalJVMConfig:
    - -XX:+UnlockDiagnosticVMOptions
    - -XX:G1NumCollectionsKeepPinned=10000000
  resources:
    limits:
      cpu: 4
      memory: 6Gi
    requests:
      cpu: 1
      memory: 2Gi
  livenessProbe:
    initialDelaySeconds: 20
    periodSeconds: 10
    timeoutSeconds: 5
    failureThreshold: 6
    successThreshold: 1
  readinessProbe:
    initialDelaySeconds: 20
    periodSeconds: 10
    timeoutSeconds: 5
    failureThreshold: 6
    successThreshold: 1
  lifecycle:
    preStop:
      exec:
        command: ["/bin/sh", "-c", "curl -v -X PUT -d '\"SHUTTING_DOWN\"' -H \"Content-type: application/json\" http://localhost:8081/v1/info/state"]
  terminationGracePeriodSeconds: 30
  additionalConfigFiles:
   # Specify this node is not the coordinator
    is_coordinator.txt: |
      false

ingress:
  enabled: true
  className: traefik
  annotations:
    traefik.ingress.kubernetes.io/router.entrypoints: web
  hosts:
    - host: trino.example.com # Change this to your domain name for the Trino service
      paths:
        - path: /
          pathType: Prefix

You can locate this file at deployment/trino/values.yaml path. Since we need to access the Trino UI from our web browser and configure it behind Cloudflare (should be used only for testing purposes, for production use, this should be replaced with a Load Balancer, e.g: ELB or GKE LoadBalancers), we need to expose it using Traefik ingress. Please change the host value to an appropriate value. When a load balancer or proxy server accepts a TLS-encrypted connection, it adds a forwarded HTTP header to the request, such as X-Forwarded-Proto: https. By default, Traefik ingress controller on K3s only adds these headers when requests are made from allowed IP addresses, we can overwrite this behavior by making some configuration to Traefik.

apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: traefik
  namespace: kube-system
spec:
  valuesContent: |-
    additionalArguments:
      - "--entryPoints.web.forwardedHeaders.insecure"
    ports:
      traefik:
        expose: true
    providers:
      kubernetesCRD:
        allowCrossNamespace: true

You can find this configuration file at deployment/traefik/traefik-custom-conf.yaml. To apply this configuration, execute the below command.

kubectl apply -f deployment/traefik/traefik-custom-conf.yaml

Apart from the values file, we need to make some modifications to the Ranger plugin configuration. The configuration file can be found at deployment/trino/config/ranger-plugin/install.properties.

POLICY_MGR_URL=http://ranger-admin-admin-svc:6080 # Ranger Admin URL (use headless service name in this tutorial)
REPOSITORY_NAME=dev_trino # Must be identical to the server.node.name configured in values file
...
XAAUDIT.OPENSEARCH.ENABLE=true
XAAUDIT.OPENSEARCH.URL=opensearch-cluster-master-headless # Opensearch instance hostname (use headless service name in this tutorial)
XAAUDIT.OPENSEARCH.USER=admin # Opensearch username
XAAUDIT.OPENSEARCH.PASSWORD=admin # Opensearch password (can put anything in it for now)
XAAUDIT.OPENSEARCH.INDEX=ranger_audits
XAAUDIT.OPENSEARCH.PORT=9200
XAAUDIT.OPENSEARCH.PROTOCOL=http
...
CUSTOM_USER=root
CUSTOM_GROUP=root

Note: for Ranger plugin, audit-logs section is configured for Opensearch section, not Elasticsearch.

We then need to create a Kubernetes secret from that file by executing this command.

kubectl create secret generic trino-ranger-plugin-config-secret --from-file=deployment/trino/config/ranger-plugin/install.properties -n trino-cluster

Our Trino cluster is ready to be deployed, we do it by simply executing this command.

# Adding Trino repository to Helm
helm repo add trino https://trinodb.github.io/charts
# Deploying Trino
helm install trino trino/trino -f deployment/trino/values.yaml -n trino-cluster

You can open your web browser and navigate to the URL you specified in the values file above and verify that our Trino cluster is working

Step 5: Deploying Superset

Apache Superset is the last component in our stack. Although Superset has official Docker images, these images do not have Trino-related libraries pre-installed. Therefore, we need to create a new image based on the official image to install all required libraries. You can find the Dockerfile used to create the customized Superset image at docker-images/superset or use my pre-built image.

We will use the official Trino Helm Chart to deploy our Trino Cluster, you can find more information about this Helm Chart here.

image:
  repository: qnguyen3496/superset
  tag: "4.0.1"
  pullPolicy: IfNotPresent

initImage:
  repository: apache/superset
  tag: dockerize
  pullPolicy: IfNotPresent

service:
  type: ClusterIP
  port: 8088

ingress:
  enabled: true
  ingressClassName: traefik
  annotations:
    traefik.ingress.kubernetes.io/router.entrypoints: web
  path: /
  pathType: Prefix
  hosts:
    - superset.example.com # Change this to your domain name for the Superset service

supersetNode:
  replicas:
    enabled: true
    replicaCount: 1
  autoscaling:
    enabled: false
  podDisruptionBudget:
    enabled: false
  connections:
    redis_host: '{{ .Release.Name }}-redis-headless'
    redis_port: "6379"
    redis_user: ""
    redis_cache_db: "1"
    redis_celery_db: "0"
    redis_ssl:
      enabled: false
    db_host: postgres-trino-postgresql-hl # Postgres hostname (use headless service name)
    db_port: "5432"
    db_user: superset # Postgres Superset database user
    db_pass: super-secret-password-superset-admin # Postgres Superset database password
    db_name: superset # Postgres Superset database
  resources:
    limits:
      cpu: 4
      memory: 4Gi
    requests:
      cpu: 1
      memory: 1Gi

supersetWorker:
  replicas:
    enabled: true
    replicaCount: 1
  autoscaling:
    enabled: false
  resources:
    limits:
      cpu: 4
      memory: 4Gi
    requests:
      cpu: 1
      memory: 1Gi

init:
  enabled: true
  loadExamples: false
  createAdmin: true
  adminUser:
    username: admin
    firstname: Superset
    lastname: Admin
    email: admin@superset.com
    password: "super-secret-password-superset-admin" # Change this password

postgresql:
  enabled: false # Disable creating a separated Postgres database

redis:
  enabled: true
  architecture: standalone
  auth:
    enabled: false
  master:
    persistence:
      enabled: false

extraSecretEnv:
  SUPERSET_SECRET_KEY: 'super-secret-key-superset' # Change this key

You can locate this file in deployment/superset/values.yaml path. Since we need to access the Superset UI from our web browser, we need to configure a Traefik ingress for it.

Finally, we execute the bellow command to deploy our Superset cluster.

# Adding Superset repository to Helm
helm repo add superset https://apache.github.io/superset
# Deploying Superset
helm install superset superset/superset -f deployment/superset/values.yaml -n trino-cluster

It would take around 4–5 minutes for the Superset cluster to be up and running for the first time.

Open your browser and navigate to the Superset URL configured above to verify that it is working.

Integrating components

At this stage, we have finished deploying all necessary components for our Trino — Ranger — Superset stack. However, these components are not connected at this point. We need to do some setups, mostly on web UI, to finish our deployment. Let’s get to it!

Creating Ranger Admin and Trino connection

Open your browser and navigate to the Ranger Admin URL, you can login using the admin user with the password specified in install.properties file from step 3 of the Deploying Components section. The browser will redirect to the homepage of Ranger Admin.

In the Trino section of the homepage, click on the + icon to create a new connection to our Trino cluster.

Now, at Create Service screen, we need to fill out the following information:

Service Name: must be identical to the value server.node.environment that we specified in the values file of the Trino cluster in step 4 above.
Display Name: provide a name for the service that will be displayed later on the homepage of Ranger Admin.
Active Status: choose Enabled .
Username: the username you specified in auth.passwordAuth on Trino values file.
Username: the password you specified in auth.passwordAuth on Trino values file.
jdbc.driverClassName: leave it as default io.trino.jdbc.TrinoDriver
jdbc.url: the format as follows jdbc:trino://[HOST]:[PORT]

After filling out all the information above, do not click Test Connection immediately, if you do, we will get an error. What we will do now is click Add button instead, and the browser will redirect us to the Ranger Admin homepage. Now we will see the new service that we have just created. Next, click the edit icon to go back to Create Service page.

Successfully verify the connection between Ranger Admin and Trino

Click Test Connection button to see whether the connection has been successfully established or not. You may have to wait for a few minutes before clicking the test button, since the Ranger plugin on Trino coordinator may take some time to sync all default policies for the first time.

In Trino coordinator, the downloaded policies are stored as JSON files and can be found at /etc/ranger/[SERVICE_NAME]/policycache . You can go to the bash shell of the Trino coordinator and verify this.

# Go to the bash shell of the running Trino coordinator
kubectl exec -it [TRINO_COORDINATOR_POD_NAME] -n trino-cluster -- bash
# Change directory to the policy cache folder inside Trino coordinator container
cd /etc/ranger/[SERVICE_NAME]/policycache
# Print the policy file on the screen
cat trino_[SERVICE_NAME].json

Creating Superset and Trino connection

After successfully integrating Trino and Ranger Admin, next, we will create a connection from Superset to our Trino cluster. Open a new tab on your web browser and navigate to the Superset URL. On the login page, you can use the credentials specified at the init.adminUser section to login. After logging in, the browser will go to the Superset welcome page.

Next, hover your cursor to the Settings dropdown list and choose Database Connection.

On the Databases screen, Superset will list all connections that it currently has, but since we haven’t set up any, this list will be blank. We can click the + DATABASE blue button in the upper right corner of the page to create a new connection to our Trino cluster.

On the popup screen, choose Trino from the dropdown list and Superset will redirect to the next screen immediately.

We will create a connection to the TPC-H catalog, which is a default database that is mainly used for benchmarking purposes. You can find more information about this catalog here. After filling SQLALCHEMY URI box, click on the ADVANCED tab.

In the Performance section, tick on Asynchronous query execution to make Superset push the actual task execution to its Celery workers. Next, expand the Security section.

Fill out the information in the Security section, and remember to enable the Impersonation feature

We need to fill out the credentials used to connect to the Trino cluster. In addition, you need to tick the Impersonate logged in user (Presto, Trino, Drill, Hive, and GSheets) box. This will enable the impersonation feature for this connection and let all queries executed using the username of the Superset user. You can find more information about the impersonation feature here.

Finally, go back to the Basic tab and hit TEST CONNECTION button. If the connection went well, click Connect . Now we have a working connection between Superset and Trino.

A new connection between Superset and Trino has been established

You can go to SQL Lab and execute some queries to verify that the connection works as expected.

Conclusion

In conclusion, by following the detailed steps for configuration and deployment, you can successfully set up a robust data stack comprising Trino, Apache Ranger, and Superset on Kubernetes. This setup not only helps in managing data securely but also provides powerful analytics capabilities through Superset’s interactive UI.