Developer Blog | dbt Developer Hub

Optimizing Materialized Views with dbt

August 3, 2023 · 11 min read

Product Manager @ dbt Labs

note

This blog post was updated on December 18, 2023 to cover the support of MVs on dbt-bigquery and updates on how to test MVs.

Introduction

The year was 2020. I was a kitten-only household, and dbt Labs was still Fishtown Analytics. A enterprise customer I was working with, Jetblue, asked me for help running their dbt models every 2 minutes to meet a 5 minute SLA.

After getting over the initial terror, we talked through the use case and soon realized there was a better option. Together with my team, I created lambda views to meet the need.

Flash forward to 2023. I’m writing this as my giant dog snores next to me (don’t worry the cats have multiplied as well). Jetblue has outgrown lambda views due to performance constraints (a view can only be so performant) and we are at another milestone in dbt’s journey to support streaming. What. a. time.

Today we are announcing that we now support Materialized Views in dbt. So, what does that mean?

Create dbt Documentation and Tests 10x faster with ChatGPT

July 18, 2023 · 8 min read

Pedro Brito de Sa

Product Analyst @ Sage

Whether you are creating your pipelines into dbt for the first time or just adding a new model once in a while, good documentation and testing should always be a priority for you and your team. Why do we avoid it like the plague then? Because it’s a hassle having to write down each individual field, its description in layman terms and figure out what tests should be performed to ensure the data is fine and dandy. How can we make this process faster and less painful?

By now, everyone knows the wonders of the GPT models for code generation and pair programming so this shouldn’t come as a surprise. But ChatGPT really shines at inferring the context of verbosely named fields from database table schemas. So in this post I am going to help you 10x your documentation and testing speed by using ChatGPT to do most of the leg work for you.

Data Vault 2.0 with dbt Cloud

July 3, 2023 · 15 min read

Rastislav Zdechovan

Analytics Engineer @ Infinite Lambda

Sean McIntyre

Senior Solutions Architect @ dbt Labs

Data Vault 2.0 is a data modeling technique designed to help scale large data warehousing projects. It is a rigid, prescriptive system detailed vigorously in a book that has become the bible for this technique.

So why Data Vault? Have you experienced a data warehousing project with 50+ data sources, with 25+ data developers working on the same data platform, or data spanning 5+ years with two or more generations of source systems? If not, it might be hard to initially understand the benefits of Data Vault, and maybe Kimball modelling is better for you. But if you are in any of the situations listed, then this is the article for you!

Building a historical user segmentation model with dbt

June 13, 2023 · 14 min read

Santiago Jauregui

Data Analytics Leader @ MODO

Introduction

Most data modeling approaches for customer segmentation are based on a wide table with user attributes. This table only stores the current attributes for each user, and is then loaded into the various SaaS platforms via Reverse ETL tools.

Take for example a Customer Experience (CX) team that uses Salesforce as a CRM. The users will create tickets to ask for assistance, and the CX team will start attending them in the order that they are created. This is a good first approach, but not a data driven one.

An improvement to this would be to prioritize the tickets based on the customer segment, answering our most valuable customers first. An Analytics Engineer can build a segmentation to identify the power users (for example with an RFM approach) and store it in the data warehouse. The Data Engineering team can then export that user attribute to the CRM, allowing the customer experience team to build rules on top of it.

Accelerate your documentation workflow: Generate docs for whole folders at once

May 17, 2023 · 9 min read

Mikael Thorup

Senior Analytics Engineer @ Lunar

At Lunar, most of our dbt models are sourcing from event-driven architecture. As an example, we have the following models for our activity_based_interest folder in our ingestion layer:

activity_based_interest_activated.sql
activity_based_interest_deactivated.sql
activity_based_interest_updated.sql
downgrade_interest_level_for_user.sql
set_inactive_interest_rate_after_july_1st_in_bec_for_user.sql
set_inactive_interest_rate_from_july_1st_in_bec_for_user.sql
set_interest_levels_from_june_1st_in_bec_for_user.sql

This results in a lot of the same columns (e.g. account_id) existing in different models, across different layers. This means I end up:

Writing/copy-pasting the same documentation over and over again
Halfway through, realizing I could improve the wording to make it easier to understand, and go back and update the .yml files I already did
Realizing I made a syntax error in my .yml file, so I go back and fix it
Realizing the columns are defined differently with different wording being used in other folders in our dbt project
Reconsidering my choice of career and pray that a large language model will steal my job
Considering if there’s a better way to be generating documentation used across different models

Modeling ragged time-varying hierarchies

May 2, 2023 · 18 min read

Sterling Paramore

Sr Staff Data Engineer @ Mainspring Energy

This article covers an approach to handling time-varying ragged hierarchies in a dimensional model. These kinds of data structures are commonly found in manufacturing, where components of a product have both parents and children of arbitrary depth and those components may be replaced over the product's lifetime. The strategy described here simplifies many common types of analytical and reporting queries.

To help visualize this data, we're going to pretend we are a company that manufactures and rents out eBikes in a ride share application. When we build a bike, we keep track of the serial numbers of the components that make up the bike. Any time something breaks and needs to be replaced, we track the old parts that were removed and the new parts that were installed. We also precisely track the mileage accumulated on each of our bikes. Our primary analytical goal is to be able to report on the expected lifetime of each component, so we can prioritize improving that component and reduce costly maintenance.

Data engineers + dbt v1.5: Evolving the craft for scale

May 1, 2023 · 12 min read

Sung Won Chung

Solutions Architect @ dbt Labs

Kira Furuichi

Technical Writer @ dbt Labs

I, Sung, entered the data industry by chance in Fall 2014. I was using this thing called audit command language (ACL) to automate debits equal credits for accounting analytics (yes, it’s as tedious as it sounds). I remember working my butt off in a hotel room in Des Moines, Iowa where the most interesting thing there was a Panda Express. It was late in the AM. I’m thinking about 2 am. And I took a step back and thought to myself, “Why am I working so hard for something that I just don’t care about with tools that hurt more than help?”

Why we're deprecating the dbt_metrics package

April 26, 2023 · 5 min read

Callum McCann

Senior Developer Experience Advocate @ dbt Labs

Hello, my dear data people.

If you haven’t read Nick & Roxi’s blog post about what’s coming in the future of the dbt Semantic Layer, I highly recommend you read through that, as it gives helpful context around what the future holds.

With that said, it has come time for us to bid adieu to our beloved dbt_metrics package. Upon the release of dbt-core v1.6 in late July, we will be deprecating support for the dbt_metrics package.

How we reduced a 6-hour runtime in Alteryx to 9 minutes with dbt and Snowflake

April 25, 2023 · 12 min read

Arthur Marcon

Analytics Engineer @ Indicium Tech

Lucas Bergo Dias

Analytics Engineer @ Indicium Tech

Christian van Bellen

Analytics Engineer @ Indicium Tech

Alteryx is a visual data transformation platform with a user-friendly interface and drag-and-drop tools. Nonetheless, Alteryx may have difficulties to cope with the complexity increase within an organization’s data pipeline, and it can become a suboptimal tool when companies start dealing with large and complex data transformations. In such cases, moving to dbt can be a natural step, since dbt is designed to manage complex data transformation pipelines in a scalable, efficient, and more explicit manner. Also, this transition involved migrating from on-premises SQL Server to Snowflake cloud computing. In this article, we describe the differences between Alteryx and dbt, and how we reduced a client's 6-hour runtime in Alteryx to 9 minutes with dbt and Snowflake at Indicium Tech.

Building a Kimball dimensional model with dbt

April 20, 2023 · 20 min read

Jonathan Neo

Data Engineer @ Canva & Data Engineer Camp

Dimensional modeling is one of many data modeling techniques that are used by data practitioners to organize and present data for analytics. Other data modeling techniques include Data Vault (DV), Third Normal Form (3NF), and One Big Table (OBT) to name a few.

Data modeling techniques on a normalization vs denormalization scale

While the relevance of dimensional modeling has been debated by data practitioners, it is still one of the most widely adopted data modeling technique for analytics.

Despite its popularity, resources on how to create dimensional models using dbt remain scarce and lack detail. This tutorial aims to solve this by providing the definitive guide to dimensional modeling with dbt.

By the end of this tutorial, you will:

Understand dimensional modeling concepts
Set up a mock dbt project and database
Identify the business process to model
Identify the fact and dimension tables
Create the dimension tables
Create the fact table
Document the dimensional model relationships
Consume the dimensional model

dbt Squared: Leveraging dbt Core and dbt Cloud together at scale

April 17, 2023 · 12 min read

João Antunes

Lead Data Engineer, Global Product Strategy @ Roche

Yannick Misteli

Head of Engineering, Global Product Strategy @ Roche

Sean McIntyre

Senior Solutions Architect @ dbt Labs

Teams thrive when each team member is provided with the tools that best complement and enhance their skills. You wouldn’t hand Cristiano Ronaldo a tennis racket and expect a perfect serve! At Roche, getting the right tools in the hands of our teammates was critical to our ability to grow our data team from 10 core engineers to over 100 contributors in just two years. We embraced both dbt Core and dbt Cloud at Roche (a dbt-squared solution, if you will!) to quickly scale our data platform.

The missing guide to debug() in dbt

March 29, 2023 · 7 min read

Benoit Perigaud

Staff Analytics Engineer @ dbt Labs

Editor's note—this post assumes intermediate knowledge of Jinja and macros development in dbt. For an introduction to Jinja in dbt check out the documentation and the free self-serve course on Jinja, Macros, Packages.

Jinja brings a lot of power to dbt, allowing us to use ref(), source() , conditional code, and macros. But, while Jinja brings flexibility, it also brings complexity, and like many times with code, things can run in expected ways.

The debug() macro in dbt is a great tool to have in the toolkit for someone writing a lot of Jinja code, but it might be difficult to understand how to use it and what benefits it brings.

Let’s dive into the last time I used debug() and how it helped me solve bugs in my code.

Audit_helper in dbt: Bringing data auditing to a higher level

March 24, 2023 · 15 min read

Arthur Marcon

Analytics Engineer @ Indicium Tech

Lucas Bergo Dias

Analytics Engineer @ Indicium Tech

Christian van Bellen

Analytics Engineer @ Indicium Tech

Auditing tables is a major part of analytics engineers’ daily tasks, especially when refactoring tables that were built using SQL Stored Procedures or Alteryx Workflows. In this article, we present how the audit_helper package can (as the name suggests) help the table auditing process to make sure a refactored model provides (pretty much) the same output as the original one, based on our experience using this package to support our clients at Indicium Tech®.

BigQuery ingestion-time partitioning and partition copy with dbt

March 10, 2023 · 8 min read

Christophe Oudar

Staff Software Engineer @ Teads

At Teads, we’ve been using BigQuery (BQ) to build our analytics stack since 2017. As presented in a previous article, we have designed pipelines that use multiple roll-ups that are aggregated in data marts. Most of them revolve around time series, and therefore time-based partitioning is often the most appropriate approach.

Tips and advice to study for, and pass, the dbt Certification exam

February 16, 2023 · 9 min read

Callie White

Analytics Consultant @ Montreal Analytics

Jade Milaney

Analytics Consultant @ Montreal Analytics

The new dbt Certification Program has been created by dbt Labs to codify the data development best practices that enable safe, confident, and impactful use of dbt. Taking the Certification allows dbt users to get recognized for the skills they’ve honed, and stand out to organizations seeking dbt expertise.

Over the last few months, Montreal Analytics, a full-stack data consultancy servicing organizations across North America, has had over 25 dbt Analytics Engineers become certified, earning them the 2022 dbt Platinum Certification award.

In this article, two Montreal Analytics consultants, Jade and Callie, discuss their experience in taking, and passing, the dbt Certification exam to help guide others looking to study for, and pass the exam.

How we cut our tests by 80% while increasing data quality: the power of aggregating test failures in dbt

January 24, 2023 · 8 min read

Noah Kennedy

Software Developer @ Tempus

Testing the quality of data in your warehouse is an important aspect in any mature data pipeline. One of the biggest blockers for developing a successful data quality pipeline is aggregating test failures and successes in an informational and actionable way. However, ensuring actionability can be challenging. If ignored, test failures can clog up a pipeline and create unactionable noise, rendering your testing infrastructure ineffective.

Power up your data quality with grouped checks

January 17, 2023 · 7 min read

Emily Riederer

Senior Manager Analytics @ Capital One

Imagine you were responsible for monitoring the safety of a subway system. Where would you begin? Most likely, you'd start by thinking about the key risks like collision or derailment, contemplate what causal factors like scheduling software and track conditions might contribute to bad outcomes, and institute processes and metrics to detect if those situations arose. What you wouldn't do is blindly apply irrelevant industry standards like testing for problems with the landing gear (great for planes, irrelevant for trains) or obsessively worry about low probability events like accidental teleportation before you'd locked down the fundamentals.

When thinking about real-world scenarios, we're naturally inclined to think about key risks and mechanistic causes. However, in the more abstract world of data, many of our data tests often gravitate towards one of two extremes: applying rote out-of-the-box tests (nulls, PK-FK relationships, etc.) from the world of traditional database management or playing with exciting new toys that promise to catch our wildest errors with anomaly detection and artificial intelligence.

Between these two extremes lies a gap where human intelligence goes. Analytics engineers can create more effective tests by embedding their understanding of how the data was created, and especially how this data can go awry (a topic I've written about previously). While such expressive tests will be unique to our domain, modest tweaks to our mindset can help us implement them with our standard tools. This post demonstrates how the simple act of conducting tests by group can expand the universe of possible tests, boost the sensitivity of the existing suite, and help keep our data "on track". This feature is now available in dbt-utils.

Making the leap from accountant to analytics engineer

December 15, 2022 · 9 min read

Samuel Harting

Associate Analytics Engineer @ dbt Labs

In seventh grade, I decided it was time to pick a realistic career to work toward, and since I had an accountant in my life who I really looked up to, that is what I chose. Around ten years later, I finished my accounting degree with a minor in business information systems (a fancy way of saying I coded in C# for four or five classes). I passed my CPA exams quickly and became a CPA as soon as I hit the two-year experience requirement. I spent my first few years at a small firm completing tax returns but I didn't feel like I was learning enough, so I went to a larger firm right before the pandemic started. The factors that brought me to the point of changing industries are numerous, but I’ll try to keep it concise: the tax industry relies on underpaying its workers to maintain margins and prevent itself from being top-heavy, my future work as a manager was unappealing to me, and my work was headed in a direction I wasn’t excited about.

Introducing the dbt_project_evaluator: Automatically evaluate your dbt project for alignment with best practices

November 30, 2022 · 7 min read

Grace Goheen

Analytics Engineer @ dbt Labs

Why we built this: A brief history of the dbt Labs Professional Services team

If you attended Coalesce 2022, you’ll know that the secret is out — the dbt Labs Professional Services team is not just a group of experienced data consultants; we’re also an intergalactic group of aliens traveling the Milky Way on a mission to enable analytics engineers to successfully adopt and manage dbt throughout the galaxy.

How to move data from spreadsheets into your data warehouse

November 23, 2022 · 11 min read

Joel Labes

Senior Developer Experience Advocate @ dbt Labs

Once your data warehouse is built out, the vast majority of your data will have come from other SaaS tools, internal databases, or customer data platforms (CDPs). But there’s another unsung hero of the analytics engineering toolkit: the humble spreadsheet.

Spreadsheets are the Swiss army knife of data processing. They can add extra context to otherwise inscrutable application identifiers, be the only source of truth for bespoke processes from other divisions of the business, or act as the translation layer between two otherwise incompatible tools.

Because of spreadsheets’ importance as the glue between many business processes, there are different tools to load them into your data warehouse and each one has its own pros and cons, depending on your specific use case.

Developer Blog | dbt Developer Hub

Introduction​

Introduction​

Why we built this: A brief history of the dbt Labs Professional Services team​

Introduction

Introduction

Why we built this: A brief history of the dbt Labs Professional Services team