How TNS Migrated to Google Analytics 4
In 2022, The New Stack undertook an ambitious project to replace its homegrown data analytics platform with a packaged solution from Google Cloud. The old analytics platform provided insights to TNS management as well as information to our sponsors through Google Data Studio reports. The application was powered by Google’s Big Query data warehouse.
Converting from a complex homegrown solution to Google Looker was an effort that took most of 2022 to accomplish. The project was done through Google Cloud’s Start-up Program and with its partner Bytecode. Key data sources for the new analytics platform included WordPress, Salesforce, YouTube, Simplecast, Twitter and Google Analytics.
The data modeling work was done within dbt and LookML, with Fivetran connecting the data sources to BigQuery. Looker provides the dashboard visualization and gives our sponsors real-time access to their content performance data.
At the end of 2022, the new analytics platform had completely replaced the old one and everyone was getting accustomed to viewing their reports in the new Looker dashboards.
Planning for Google Analytics 4 Migration
In early 2022, Google communicated its intentions to stop processing data in Universal Analytics (UA) properties by July 1, 2023, in favor of its successor Google Analytics 4 (GA4). This date seemed a long way off while the migration to the new platform was underway. However, the move to GA4 was always something that TNS kept in mind while the new system was being implemented.
We started to collect data in our GA4 property in January 2022 in anticipation of one day cutting over to Google Analytics 4. Setting up the GA4 property and running it in parallel with UA gave us a chance to configure our custom events and compare the reports between the two systems.
The UIs between the two systems were very different. Most of the users within TNS who had mastered UA were confused at first by the new user interface. Also, the metrics calculated by the platforms were similar but different.
For TNS, pageviews and average time on page are key measures. In GA4, pageview counts were similar to UA but not exactly the same.
Other GA4 metrics differed from relied-upon UA counterparts: Average Time on Page was removed in favor of a new metric, Engagement Time.
At this point, our data warehouse was still being fed by UA through a connector from Fivetran. Our GA4 property was simply in data collection mode but not connected to the data warehouse.
Finding the Right Analytics Partner for GA4
In late 2022, The New Stack contracted with Brooklyn Data Co. to help convert our data warehouse from using UA and instead replace it with a GA4 feed. After talking with several vendors, TNS chose to work with Brooklyn Data Co. as they had experience performing GA4 migrations with other clients and were confident they could migrate our platform to the new model.
The data models developed for UA all needed to be rewritten for Google Analytics 4 as GA4 is a completely different event-driven architecture from UA. Complex logic had been written in dbt and Looker’s LookML to take the raw data provided by Fivetran’s UA connector and transform it into a format usable by Looker dashboards.
The data that is presented in Google’s UA UI is actually refined by complex logic under the hood. Getting the Looker dashboards to match what was shown in the UA interface required extensive development work. We would quickly learn that the same is true for GA4 data.
Doing the GA4 Conversion
To export the most timely data from our GA4 property, we chose to use the Google Analytics direct link to BigQuery. We configured both the streaming and daily exports to BigQuery from GA4. This is one of the nice advantages of working with the suite of products under the Google umbrella — direct integrations that do not cost anything.
Once the data arrives in BigQuery it is ready to be transformed by dbt. Brooklyn Data Company wrote a series of dbt models that are run to extract the data from source tables, and then model it in staging, intermediate and mart tables in BigQuery. Fivetran connectors load data from other data sources like Simplecast, Facebook, Youtube, WordPress and LinkedIn. These datasets are also processed by the dbt models in steps, and regenerated every three hours by dbt Cloud jobs.
With the July 1 cutoff date for UA in mind, we prioritized all the dashboards that were critical to The New Stack and worked to convert them over to the new GA4 data model first. Brooklyn Data Company performed the core of data modeling work during their engagement with The New Stack. The remaining work was done by our Bytecode analytics partner. That included some data modeling at the dbt layer as well as development within LookML for the Looker dashboards.
Transforming the GA4 Event Data
GA4 uses an event-based data model, which means that all user interactions are tracked as events. This provides a more flexible and comprehensive way to measure user behavior, as it is not limited by the traditional session-based data model used in UA.
Here are some of the key technical differences between GA4 and UA:
- Data model: GA4 uses an event-based data model, while UA uses a session-based data model. This means that all user interactions are tracked as events in GA4, while UA groups user interactions as sessions.
- Measurement: GA4 offers enhanced measurement capabilities, such as cross-device tracking and predictive analytics. UA does not offer these capabilities.
- Machine learning: GA4 is built with AI-powered solutions that fill in the gaps in conversion data by utilizing behavioral and conversion modeling. UA does not have these capabilities.
- Data privacy: GA4 offers features such as anonymized IP addresses and cookieless tracking. UA does not offer these features.
In addition to these technical differences, there are also some key differences in the way that GA4 and UA report data. For example, GA4 reports users instead of sessions, and it uses a new conversion model that counts conversions across multiple devices and sessions.
Modeling Data to Overcome Challenges
To overcome the technical differences between UA and GA4, we merged UA session data with GA4 event data in Looker. This allowed us to combine data from both platforms into a single view, which gave us a more complete understanding of our customer behavior.
Some of the challenges that we faced were:
- More granular event data.
- Combining session data with event data.
- Nested parameters.
- Sessions spanning multiple days.
- Historical reporting.
- Creating keys for fields.
Below is more detail on each:
More granular event data
One challenge that we faced was the need to replace certain UA events with more granular GA4 events. For example, UA tracks time on page, while GA4 tracks engagement time, which includes both time on page and user activity such as scrolling.
Combining session data with event data
Another challenge that we faced was the need to combine UA session data with GA4 event data that was not directly comparable. Once we had combined the UA session data with the GA4 event data, we were able to update existing reports and dashboards in Looker to give us a complete view of our customer behavior across both platforms. This allowed us to identify trends, patterns, and opportunities that we would not have been able to see without combining the data.
GA4 event parameters are nested, while UA event parameters are stored in a single row. This required us to unnest the GA4 event parameter data in Looker before we could analyze it. To unnest GA4 event parameter data in Looker, we used the _UNNEST() function. This function takes an array as input and returns a table with one row for each element in the array.
For example, to unnest the event parameter, we would use similar SQL code:
This would return a table with one row for each event, and each row would contain the source and medium values for that event.
We can use the same approach to unnest any GA4 event parameter that is nested.
Once we have unnested the GA4 event parameter data, we can analyze it in Looker just like any other data. We can then create reports and dashboards to track trends, patterns and opportunities to start analyzing the data so that we can gain valuable insights into customer behavior.
Here are some tips for unnesting GA4 event parameter data in Looker:
- Use the _UNNEST() function to unnest any GA4 event parameter that is nested.
- Be sure to specify the data type of the event parameter when you unnest it. This will help to ensure that the data is stored correctly in Looker.
- If you are unnesting multiple event parameters, you can use the JOIN clause to combine the unnested data into a single table.
- Once you have unnested the GA4 event parameter data, you can analyze it in Looker just like any other data.
Sessions spanning multiple days
Sessions can exceed 24 hours in GA4, but not in UA. This is because GA4 uses an event-based data model, while UA uses a session-based data model. In UA, a session is defined as a group of user interactions that take place within a 30-minute window. If there is no activity for more than 30 minutes, then a new session is started.
In GA4, there is no set time limit for a session. A session can start and end at any time, and it can span multiple days. This is because GA4 tracks all user interactions as events, and it does not group them into sessions.
This difference in how sessions are tracked can have an impact on how certain metrics are calculated. For example, the source of a session is the first channel or campaign that a user interacted with before starting the session. In UA, the source of a session is captured when the first event in the session is fired. In GA4, the source of a session is captured when the last event in the session is fired.
This means that in GA4, we have to look back across multiple days to capture the source event for sessions that exceed 24 hours. Some approaches that can be taken to accomplish this are by using Looker calculated fields or by using a lookback window in dbt. An example of using the lookback window in dbt can be seen here:
To combine UA and GA4 data for historical reporting, we needed to union the two datasets and model the data to allow for a 6-month overlap without double-counting page views.
The requirement was to allow reporting to show average time on page (ATP) up to June 30, 2023, while also showing engagement time starting Jan. 1, 2023. Since ATP uses page views as part of the calculation, we needed to create two page view dimensions on the UA side of the union:
- One to include 1/1/23 – 6/30/23 for ATP calculations.
- One to exclude values for anything after 1/1/23 so we could combine it with GA4 pageview counts without double-counting.
Once we had created these page view dimensions, we could union the UA and GA4 datasets and use the appropriate page view dimension for each metric to calculate ATP, as well as engagement time.
The following sample SQL code shows how these calculated fields could be created:
Creating keys for fields
We created multiple keys in the data model by concatenating fields and then hashing them.
Here are the keys we created and their purpose:
- Client key: Concatenates the user pseudo ID and stream ID. Used to look back across session windows to determine source and medium.
- Session key: Concatenates the client key and GA session ID. Used to identify unique sessions by users, count unique pageviews and sessions.
- Session partition key: Concatenates the session key and date. Used to track sessions over multiple days and break them up into individual days.
- Event key: Concatenates the session key, event name, event date, and event parameters. Used to identify unique events, such as page views.
Creating these keys allowed us to:
- Combine UA and GA4 data into a single Looker data model.
- Maintain consistency in how we track users, sessions, and events across both platforms.
- Improve the performance of our Looker queries.
Here are some additional tips for creating keys in Looker data models:
- Use consistent naming conventions for your keys. This will make it easier to understand and use your data model.
- Choose the right data types for your keys. This will help to ensure that your keys are unique and efficient to store and query.
- Consider using hashing to create your keys. This can help to improve the performance of your Looker queries.
We improved Looker performance by moving slow-running and incomplete queries to dbt. This involved breaking up and moving complex joins across multiple large tables (e.g., WordPress, podcasts/YouTube, page report, events report) and calculating metrics in dbt first.
This approach is much more performant because it reduces the amount of data that needs to be transferred and processed by Looker. Additionally, the number of joins in dbt was reduced to just WordPress and one analytics table (podcasts, YouTube, GA page report, or GA events report).
Finally, the GA events and page reports were unioned in dbt and then joined back to WordPress in Looker.
Here is a more detailed explanation of each step:
- Move slow-running and incomplete queries to dbt. dbt is a data transformation tool that can be used to generate SQL code to perform complex data transformations. This can help to improve Looker performance by moving the most computationally expensive operations to dbt.
- Break up and move complex joins across multiple large tables to dbt. Joining large tables can be very slow and resource-intensive, especially when the tables are located in different databases. By breaking up the joins into smaller, more manageable chunks and performing them in dbt, you can significantly improve the performance of your Looker queries.
- Calculate metrics in dbt first. Calculating metrics in dbt can also help to improve Looker performance by reducing the amount of data that needs to be transferred and processed by Looker. For example, instead of calculating the average number of page views per user in Looker, you could calculate this metric in dbt and then store the results in a new table. Then, you could simply join this table to your WordPress table in Looker to get the average number of page views per user for each WordPress user.
- Reduce the number of joins in dbt. As mentioned above, joins can be very slow and resource-intensive. By reducing the number of joins in dbt, you can improve the performance of your Looker queries. In this case, the number of joins was reduced to just WordPress and one analytics table.
- Union GA events and page reports in dbt. Unioning two tables combines them into a single table that contains all of the rows from both tables. This can be useful when you need to combine data from different sources into a single table for analysis. In this case, the GA events and page reports were unioned in dbt to create a single table that contains all of the GA data for each WordPress user.
- Join each new table back to WordPress in Looker. Once the data has been transformed in dbt, it can be joined back to WordPress in Looker to create the desired reports and dashboards.
By following these tips, you can create a data model that will help you to overcome the technical differences between UA and GA4 so that you are able to get a more complete view of your customer behavior.