Single vs. Multiple Datasets in AEP

So you’re modernizing your implementation by moving to AEP and you have a bunch of websites. One of the first concepts you’re introduced to in AEP is that of Datasets. It’s the place where your data goes when you send data into AEP. It’s pretty important – and it’s probably one of the first decisions you’ll make.

In Adobe Analytics, you might be creating 1 report suite and a bunch of virtual report suites. Similarly, maybe you have them all broken out with a roll-up of some sort. Well we’re not in Kansas anymore, Toto. Metaphors and design patterns are different in the AEP/CJA world. So… do I use a global dataset or do I create 1 dataset for each property? The answer is predictably: it depends.

This article isn’t for companies deciding whether to consolidate 300 websites into 1 dataset. Yes, consolidate that $#!&. This article isn’t for people with 1 website. You can’t POSSIBLY be considering joining third party datasets AND website data into a single dataset… though part of me is morbidly curious as to what would happen if you created 1 event dataset and 1 profile dataset to rule them all. I’ll save that one for later. This article is for people with 2-50 digital properties (including apps and stuff).

Outcomes We Want

Fast, Cheap, and High Quality

It’s always best to start from the outcomes that we want from our strategy. So let’s break it down into a few high-level goals. We want it to be…

… low maintenance.

low cost.

… the highest data quality.

This seems reasonable. Maintenance includes stuff like making it intuitive, factoring in DULE classifications, stuff like that. With cost, there are some factors we’ll want to consider, but silly mistakes are easily avoidable. Finally, data quality is paramount. We can create super cute solutions, but if it’s at the expense of quality data…

By the way, I’m writing this under the assumption that you’re using 1 global schema across your various properties. You might have a use case where you need different schemas for different websites. Clearly, you’ll need different datasets for different schemas. This article isn’t really for you… and I hope you’ve put thought into your decision to use such a bespoke approach.

Maintenance

We have not yet surrendered to Tedius, the Greek god of getting stuck doing operational crap… right? I know a lot of new tools have come out, but I think we still care about this. If I create 200 datasets, am I going to have to hire a team to maintain them? The short answer is no. The long answer is if you have 200 datasets, you probably already have a team.

One dataset puts pressure on you to really lean into your Sandbox environment to make sure changes don’t screw anything up. On the other hand, multiple datasets can introduce inconsistencies if you forget to consistently toggle settings like Error Thresholding.

Another consideration is that there’s a maximum number of datasets per connection (100). If you plan to create a connection with 100+ datasets so you can have that neato merged Data View – you may want to reconsider. From a maintenance perspective, every time a new dataset is added it must also be added to the CJA connection. 

More datasets = more maintenance. However, long-term it’s pretty negligible. A lot of the extra work you do up-front might actually save some down the line. We’ll talk about that later.

Cost

You think this stuff is free? We want to make sure the licensing doesn’t end up on our performance review.

Conceptually, it all costs the same. Since we’re working with humans, there might be some variability here. If I’m a large company with a lot of datasets, I might be more inclined to create multiple connections to accommodate specific needs. For instance, I might create a connection for every website, then another for some subset of websites. Every connection basically creates a lil’ storage center for your data, enabling quick retrieval. That’s why there can be a cost associated with volume of data that’s being passed to CJA via a connection. Multiple datasets could embolden users to create more connections with redundancies (and, therefore, impact cost).

One could argue that a single, large dataset is also subject to a much more consequential error if someone was to add that dataset to a connection. This is true, but would likely happen at a lower frequency (read: I am guessing). There’s also that Stitching limitation you have based on your CJA package. If your contract states you can only stitch 5 datasets and you want to stitch 10? Well…

Then there’s the Profile Data Store. If you have multiple properties and you do NOT want all of them to be enabled for Profile because you’ll exceed your limit, you will need to separate them (at least the non-Profile ones).

Neither strategy is a clear winner here. Multiple datasets is probably a little safer.

Data Quality

Your data will be the same no matter where you plop it. You should be using that global schema. What could go wrong? The quality of your data will inevitably be the responsibility of its source. Maybe that’s Adobe Launch. Maybe it’s coming from somewhere else. From a configuration perspective, a single dataset ensures consistency (see the Maintenance section).

However, here’s where multiple datasets really shine. It’s SO much easier to monitor the dataset performance when they’re split out. It’s SO much more convenient to look at the failed batches. Data look funny? Click Dataset and look for red dots. It’s so brain-dead simple, I could literally assign that task to the CEO. You might think that finding the red dot is easier with a larger dataset. It’s not! I mean, if I have a routine of checking on my datasets each morning while I make my Frappuccino and avocado toast – I can immediately determine where a problem exists when the datasets are split out. I can’t do that with 1 dataset.

This can STILL be achieved with a single, larger dataset; but it takes quite a few more steps to really get down to a single culprit (is it a problem with 1 website or 20?).

Final Thoughts

Single, global datasets are cleaner. They force data into a standard pattern (think partial ingestion settings, Profile toggle, labels and contracts). They probably reduce risk of redundancies in connections. Using Query Service might be a little easier. It’s lower maintenance, as you don’t have to create a new dataset with every new property. If it wasn’t for the difference in monitoring, I would recommend everyone use a single dataset. And monitoring is too damn important.

In theory, you shouldn’t be spending much time in the dataset interface. After setup, you’re not really in there doing stuff (I hope). You can lock down connections to control cost risks. Realistically, you probably don’t have 50 datasets that need stitching. When you ARE thinking about datasets… when you ARE in the interface – it’s because you’re monitoring something.

Is data for [some website] being ingested properly?

What’s causing all of these batch failures?

You get the idea. There are ways to debug this stuff with 1 dataset, but nothing turnkey out-of-the-box. For that reason, if you have fewer than 20 or so properties – using separate datasets is probably your best option.

Other Thoughts

It’s hard to collect every single data point to make a recommendation. I want to include other folks’ input on this topic because the end-goal is that you have complete information. Here are some other inputs for your consideration!

Garrett Chung posted on LinkedIn:

Another consideration for splitting datasets is when we have multiple source systems making a subset of updates to profile attributes or making use of Dataset precedence in merge policies.

This is a great point in favor of multiple datasets! The first point he made talks about leveraging multiple source systems and adding unique attributes that may exist in one place but not another. When you have one big dataset, it can get a little messy when adding a parameter that exists on one property but not another. The other part of the comment is about Identity merge policies. When you are setting up merge policies, you may want to increase the precedence of datasets that have a login (while marketing sites may not). Multiple datasets allow you to ensure that the identity set on the authenticated website will be the primary identity. This is very important.

Leave a Comment