Module 4: Analytics Engineering (DTC DE Zoomcamp Week 4)
dbt
(Data Build Tool) Overview
What
is dbt?
dbt
stands for data build tool. It's a transformation tool that allows us to
transform process raw data in our Data Warehouse
to transformed data which can be later used by Business Intelligence tools and
any other data consumers.
dbt also allows us to introduce good
software engineering practices by defining a deployment workflow:
1.
Develop models
2. Test
and document models
3.
Deploy models with version control and CI/CD.
How
does dbt work?
dbt
works by defining a modeling layer that stands on top of our DataWarehouse.
Each table is turned into a model and then
transformed into a derived model, that can be stored into the DataWarehouse for
persistence.
A
model consists in:
• *.sql
file
• Select
statement, no DDL or DML are used
• File
that dbt will compile and run in our DataWarehouse
How
to use dbt?
dbt
has 2 main components: dbt Core and dbt Cloud with the following
characteristics:
dbt Cloud - SaaS application to develop and
manage dbt projects
•
Web-based IDE to develop, run and test a dbt project.
• Jobs
orchestration.
•
Logging and alerting.
•
Intregrated documentation.
• Free
for individuals (one developer seat).
dbt Core - open-source project that allows
the data transformation
• Builds
and runs a dbt project (.sql and .yaml files).
•
Includes SQL compilation logic, macros and database adapters.
•
Includes a CLI interface to run dbt commands locally.
•
Open-source and free to use.
For
this project, I use :
• dbt
Cloud (dbt Cloud IDE) + GCP BigQuery
• dbt
Core on Docker + BigQuery
Developing
with dbt
Anatomy
of a dbt model
dbt
models are a combination of SQL (using SELECT statements) and Jinja templating language to define templates.
Below
is an example of abt model:
{{
config(materialized='table')
}}
SELECT *
FROM
staging.source_table
WHERE
record_state = 'ACTIVE'
·
In
the Jinja statement defined within {{ }} block we call the config function.
More information about Jinja and how
to use it for dbt in this link.
·
The
config function is commonly used at the beginning of a model to define a
materialization strategy: a strategy for
persisting dbt models in a data warehouse
·
There
are 4 materialization strategies with the following characteristics:
·
View:
When using the view materialization, the model is rebuilt as a view on each
run, via a create view as statement
• Table:
The model is rebuilt as a table on each run, via a create table as statement
·
Incremental:
Allow dbt to insert or update records into a table since the last time that dbt
as run Ephemeral: Are
not directly build into the database. Instead, dbt will interpolate the code
from this model into dependent
models as a common table expression (CTE)
The
FROM clause of a dbt model
The
FROM clause within a SELECT statement defines the sources of the data to be
used.
The
following sources are available to dbt models:
Sources : The data loaded within our Data
Warehouse.
• We can
access this data with the source() function.
·
The
sources key in ourYAML file contains the details of the databases that the
source() function can access and translate
into proper SQL-valid names.
·
Additionally,
we can define "source freshness" to each source so that we can check
whether a source is "fresh" or
"stale",
which can be useful to check whether our data pipelines are working properly.
·
More
info about sources in [this link}(https://docs.getdbt.com/docs/building-a-dbt-project/using-sources).
Seeds : CSV files which can be stored in
our repo under the seeds folder.
• The
repo gives us version controlling along with all of its benefits.
• Seeds
are best suited to static data which changes infrequently.
• Seed
usage:
◦ Add a
CSV file to our seeds folder.
◦ Run
the dbt seed command to create a table in our Data
Warehouse.
•
|
If
we update the content of a seed, running dbt seed will append the updated
values to the table rather than substituing
them. Running dbt seed --full-refresh instead will drop the old table and
create a new one. |
|
◦
Refer to the seed in our model with the ref() function. |
◦ More
info about seeds in this link.
◦
Here's
an example of how we would declare a source in a .yml file:
sources:
- name: staging
database: production
schema: trips_data_all
loaded_at_field: record_loaded_at
tables:
- name: green_tripdata
- name: yellow_tripdata
freshness:
error_after: {count: 6, period: hour}
And
here's how we would reference a source in a FROM clause:
FROM {{
source('staging','yellow_tripdata') }}
• The
first argument of the source() function is the source name, and the second is
the table name.
In
the case of seeds, assuming we've got a taxi_zone_lookup.csv file in our seeds
folder which contains locationid, borough,
zone and service_zone:
SELECT
locationid,
borough,
zone,
replace(service_zone, 'Boro', 'Green') as service_zone
FROM {{
ref('taxi_zone_lookup) }}
The
ref() function references underlying tables and views in the DataWarehouse.
When compiled, it will automatically build
the dependencies and resolve the correct schema fo us. So, if BigQuery contains
a schema/dataset called dbt_dev inside
the my_project database which we're using for development and it contains a
table called stg_green_tripdata, then the
following code...
WITH
green_data AS (
SELECT *,
'Green' AS service_type
FROM {{ ref('stg_green_tripdata') }}
),
...will
compile to this:
WITH
green_data AS (
SELECT *,
'Green' AS service_type
FROM
"my_project"."dbt_dev"."stg_green_tripdata"
),
·
The
ref() function translates our references table into the full reference, using
the database.schema.table structure.
·
If
we were to run this code in our production environment, dbt would automatically
resolve the reference to make ir
point to our production schema.
Defining
a source and creating a model
It's
time to create our first model.
We
will begin by creating 2 new folders under our models folder:
•
staging will have the raw models.
• core
will have the models that we will expose at the end to the BI tool,
stakeholders, etc.
Under
staging we will add 2 new files: sgt_green_tripdata.sql and schema.yml:
# schema.yml
version: 2
sources:
- name: staging
database: our_project
schema: trips_data_all
tables:
- name: green_tripdata
- name: yellow_tripdata
• We
define our sources in the schema.yml model properties file.
• We are
defining the 2 tables for yellow and green taxi data as our sources.
--
sgt_green_tripdata.sql
{{
config(materialized='view') }}
select *
from {{ source('staging', 'green_tripdata') }} limit
100
• This
query will create a view in the staging dataset/schema in our database.
• We make
use of the source() function to access the green taxi data table, which is
defined inside the schema.yml
file.
The
advantage of having the properties in a separate file is that we can easily
modify the schema.yml file to change the database
details and write to different databases without having to modify our
sgt_green_tripdata.sql file.
we
may know run the model with the dbt run command, either locally or from dbt
Cloud.
Macros
Macros
are pieces of code in Jinja that can be reused, similar to functions in other
languages.
dbt
already includes a series of macros like config(), source() and ref(), but
custom macros can also be defined.
Macros
allow us to add features to SQL that aren't otherwise available, such as:
• Use
control structures such as if statements or for loops.
• Use
environment variables in our dbt project for production.
• Operate
on the results of one query to generate another query.
•
Abstract snippets of SQL into reusable macros.
Macros
are defined in separate .sql files which are typically stored in a macros
directory.
There
are 3 kinds of Jinja delimiters:
• {% ...
%} for statements (control blocks, macro definitions)
• {{ ...
}} for expressions (literals, math, comparisons, logic, macro calls...)
• {# ...
#} for comments.
•
Here's
a macro definition example:
{# This
macro returns the description of the payment_type #}
{% macro
get_payment_type_description(payment_type) %}
case {{ payment_type }}
when 1 then 'Credit card'
when 2 then 'Cash'
when 3 then 'No charge'
when 4 then 'Dispute'
when 5 then 'Unknown'
when 6 then 'Voided trip'
end
{% endmacro
%}
·
The
macro keyword states that the line is a macro definition. It includes the name
of the macro as well as the parameters.
·
The
code of the macro itself goes between 2 statement delimiters. The second
statement delimiter contains an endmacro
keyword.
·
In
the code, we can access the macro parameters using expression delimiters.
·
The
macro returns the code we've defined rather than a specific value.
Here's
how we use the macro:
select
{{ get_payment_type_description('payment-type') }} as
payment_type_description,
congestion_surcharge::double precision
from {{
source('staging','green_tripdata') }}
where
vendorid is not null
• We pass
a payment-type variable which may be an integer from 1 to 6.
And
this is what it would compile to:
select
case payment_type
when 1 then 'Credit card'
when 2 then 'Cash'
when 3 then 'No charge'
when 4 then 'Dispute'
when 5 then 'Unknown'
when 6 then 'Voided trip'
end as payment_type_description,
congestion_surcharge::double precision
from {{
source('staging','green_tripdata') }}
where
vendorid is not null
•
|
The macro is replaced by the code
contained within the macro definition as well as any variables that we may
have passed to the
macro parameters.
|
Packages
Macros
can be exported to packages,
similarly to how classes and functions can be exported to libraries in other
languages. Packages contain
standalone dbt projects with models and macros that tackle a specific problem
area.
When
we add a package to our project, the package's models and macros become part of
our own project.A list of useful packages
can be found in the dbt package hub.
To
use a package, we must first create a packages.yml file in the root of our work
directory. Here's an example:
packages:
-
package: dbt-labs/dbt_utils
version: 0.8.0
After
declaring our packages, we need to install them by running the dbt deps command
either locally or on dbt Cloud.
we
may access macros inside a package in a similar way to how Python access class
methods:
select
{{ dbt_utils.surrogate_key(['vendorid', 'lpep_pickup_datetime']) }} as
tripid, cast(vendorid as integer) as
vendorid,
-- ...
• The
surrogate_key() macro generates a hashed surrogate key with the specified
fields in the arguments.
Variables
Like
most other programming languages, variables
can be defined and used across our project.Variables can be defined in 2 different ways:
• Under
the vars keyword inside dbt_project.yml.
vars:
payment_type_values: [1, 2, 3, 4, 5, 6]
• As
arguments when building or running our project.
dbt build
--m --vars 'is_test_run: false'
Variables
can be used with the var() macro. For example:
{% if
var('is_test_run', default=true) %}
limit 100
{% endif %}
·
In
this example, the default value for is_test_run is true; in the absence of a
variable definition either on the dbt_project.yml
file or when running the project, then is_test_run would be true.
·
Since
we passed the value false when runnning dbt build, then the if statement would
evaluate to false and the code
within would not run.
Setting
up dbt Cloud IDE + GCP BigQuery
We
will need to create a dbt cloud using this link and
connect to our Data Warehouse following these instructions. More detailed instructions available in this guide
My dbt Cloud IDE +
BigQuery setup and testing
Setting
up dbt on Docker + GCP BigQuery
Comments
Post a Comment