In this article, you'll learn the ins and outs of deciding between batch vs event-driven operational analytics. I'll break down:
- Scheduled data upkeep vs action-driven events
- Considerations when using batch vs event orchestration systems
- How to prioritize stakeholder needs in your architecture
When you send your data directly to third-party tools, you allow business users (read: your marketers, salespeople, etc.) to use information directly where they spend time day-to-day. This is a really powerful data flow and one that democratizes the use of consistent, high-quality data.
If you’re already on board with this idea, you’ve probably identified how you’ll collect click and pageview event data. You’ve also scoped out the reverse ETL tool to load batch data into a destination target. Now you’re left with an important architectural decision that will shape the way business users engage with your data: How and when to send your data.
The million-dollar question: What does this decision between batch and event systems entail, and what are its impacts? Let’s discuss.
Scheduled data upkeep vs. action-driven events
Workflow orchestrators like Airflow, Dagster, and Prefect have become integral to modern data stacks. These tools run processes such as dbt queries or reverse ETL on a set schedule ranging everywhere from once per day to every 15 minutes.
Here’s the main difference between batch data upkeep and event-driven systems:
- In batch data upkeep, syncs are run on a schedule to keep a resulting dataset up to date.
- Event-driven systems maintain a stream of events that occur over time.
It always comes down to formatting—the fundamental difference between batch and event-driven systems: the format of the data in the target tool. A dynamic dataset updated with changes (batch) versus an ever-growing historical timeline of events that’s only appended to (event-driven).
Batch systems tend to be built on top of common data tools. Event orchestration occurs client-side on a website, so it must be fairly ingrained in the engineering tech stack. Common tools for event orchestration include Segment, Amplitude, Snowplow, among many others.
For illustration purposes, consider a typical CRM tool like Salesforce or Hubspot. A CRM contains lead contact information like name, email, and phone number. Additional data could include the last date this lead interacted with your site and the ad platform (Facebook, Google, etc.) the lead was sourced from. For example, if a lead changed their phone number, the next data sync would update this crucial piece of information.
By contrast, event-driven systems will send the specific event occurrence to the target tool. Continuing with the same example, a CRM tool likely also contains all pages viewed by the lead. Is the person interested in your paid or free offering? Is the person engaging with your product or blog pages? Has the lead contacted you in the past, and what did they say? If a person clicks on a link, the click event would (fairly quickly) show up in Salesforce.
This CRM example is just one of many use cases where the issue of batch vs. event-driven systems comes into play. Let’s dive into what business use cases would be most applicable in each system design, as well as the resources needed to implement it.
Considerations when using batch vs. event orchestration systems
Generally speaking, there are three primary considerations that determine if you should go the batch or event systems route:
- The team within your organization
- A desire to update historical data (or not)
- Business use case
Let me break down the considerations and which system is best for each.
The team within your organization
Event orchestration is usually integrated within a web application, while batch uploads are easily configured with a data warehouse or data lake. Each system requires its own expertise. Ask yourself: Is your company riddled with data engineers or front-end developers? If the former, batch uploads may be the easiest route; if the latter, you may want to go with event orchestration.
However, event orchestration gets tricky when trying to consider all caveats. For instance, anyone implementing or using the data should fundamentally be aware of ad blockers and their effects.
With batch jobs, there’s more stress on the underlying data than the actual reverse ETL implementation. This means your analytics team is tasked with formatting and understanding what data needs to be sent.
Desire to update historical data.
If an organization’s concept of a customer changes, they might need a bulk update to make sure the leads in the CRM are the contacts the sales team wants to focus on. In a batch system, any update to underlying data would bulk update the destination tool; this type of change would be a walk in the park.
However, by definition, event data is append-only. In non-tech speak, this means historical events never go away. If a lead viewed the /blog page that was renamed to /community, the sales team would have to account for both cases when targeting a group of leads.
Consider how often data that already exists updates. If often, batch updates would be easier to sustain automatically.
It all comes down to the use case.
Event data makes triggering emails on a particular action extremely straightforward. For example, consider a sales strategy that automatically triggers emails three hours after someone views a demo page with the hopes of engaging the lead while their interest is high. Both are easy to implement if the event data contains a /demo page view event and the batch data contains a last_demo_page_viewed_time field.
However, what if the sales strategy changes to three hours after someone views the community page instead of the demo page? Sales can update this logic directly in Salesforce when using event data as all the data is readily available. In the batch scenario, the analytics team would need to add a field for the last time the community page was viewed to make this possible.
When triggering particular actions based on particular events, relying on those same events by sending them directly to the target tool will almost always be easier for the stakeholder.
At the end of the day, the decision comes down to who uses your data (and how)
With continuous syncing, the decision is less of a question of how quickly the data should be updated but more a question of how fresh end users need the data to be (realistically). Both reverse ETL tools like Census and event orchestration tools like Segment could operate with near real-time updates or on a delay.
Let’s recall the reason we’re syncing to destination tools: to make stakeholders’ jobs easier.
Keeping that in mind, talk to your stakeholders about the workflows they want to build. Ask your sales team if they want to send emails on a schedule to a population of users meeting criteria at one point in time (easily run by a batch system), or send emails on a rolling basis triggered by a particular action (catered towards events).
The last thing you want to do is format the data in such a way that’s fundamentally unusable by the stakeholder, in which case they’ll just continue to make manual updates. That’s not convenient for anyone.
Piqued your interest? Check out the Census Airflow Provider and sign up for a free trial to start experimenting.
Have any other questions about event orchestration systems? I’m happy to chat on Twitter or LinkedIn.