Latest Posts
-
Cardinality is a harsh mistress
This was originally supposed to be a quick summary on why privacy mechanisms such as Divvy Up wouldn’t work for DNS TAPIR. Divvi Up provides diffferential privacy (DP) for, mainly scalar values, by obscuring the true values across multiple independent, non-colluding servers.
Turns out, the question touches on several core features of DNS data, and the difficulty of collecting it while maintaining client privacy. It is also an extensive rabbit hole.
The cardinality problem, user version
This is also known as the “count distinct” problem. We keep using that word. It doesn’t quite mean what we think it means. What we want to illustrate is the difficulty of maintaining a (reasonably) correct count of the unique members having queried a specific domain within a given time frame. An example:
Domain Time Clients Comment example.com 12.15 5 Individual number is correct example.com 12.20 7 3 users same as above (only 4 are new) example.com 12.25 22 5 users same as above and 2 additional from the top line (15 new) example.com 12.30 9 7 users same as above (2 new) sum: 43 5 + 4 + 15 + 2 = 26 unique Obviously, just keeping a count of the number of users results in a significant over-count of users. In the other extreme, we keep a key for each user. Say:
Domain Time Clients Comment example.com 12.15 A,B,C,D,E 5 unique users example.com 12.20 A,C,E,F,G,
H,I4 new example.com 12.25 B,D,C,E,G, H,I,J,K,L, M,N,O,P,Q, R,S,T,U,V,
W,X15 new example.com 12.30 D,G,J,M,P, R,S,Y,Z 2 new count: 26 5 + 4 + 15 + 2 = 26 unique This gives us a correct count, but for each additional time period the clients need to be distinctly counted, which is computationally costly. You also need to store all the keys, which uses lots of storage, especially for Internet Service Providers who count their clients in the millions. Maintaining keys (even “made up” or pseudonymized keys) for clients is also troublesome for privacy reasons, and a rabbit hole in its own right (let’s not go there today).
To not store all the keys and still get the user count in the right ballpark, there are mathematical tricks to achieve a decent approximation. The one used by DNS TAPIR is called HyperLogLog (HLL), which is a probabilistic method based on random distribution. An informative animation can be found in this blog post from Aggregated Knowledge who also are the originators of the implementation used, mainly because it has a defined binary format for the data sketches. This provides some modicum of interoperability and allows sketches to be transferred between systems. The way information is stored makes it difficult to discern which specific user is being counted, especially when part of a larger group.
So, having dealt with the cardinality of users, are we done?
Not quite!
The cardinality problem, the DNS version
Due to all things between human ingenuity and malicious intent, the number of domain names - the domain name space - is huge. According to Domain Name Industry Brief (DNIB) there were 364.3 million registrations in the fourth quarter 2024. Not all domain names are renewed, however, so the total number of active domains are roughly 764 million (according to Domain Name Stat).
This is just at the level of the first label, ie example.com, so the number of distinct Fully Qualified Domain Names (FQDN), for example host.section.example.com or xZb3hak39a2df.random.example.com is orders of magnitude larger. Also, for the case of the random string subdomain, the FQDN is only likely to be queried once. Ever. This is commonly referred to as “the long tail”. An obvious consequence of this is that the unique query becomes an indirect identifier for the querying user, which could be seen as a feature.
Life on a spectrum
Unique queries live at one end of the DNS data, appearing only once with user information largely irrelevant and with a small time window of interest. The main points of interest for these is the name itself and the type of query. In DNS TAPIR, this directly corresponds to New Domain events generated by the Edge system whenever it sees an unknown FQDN for the first time.
Queries to well-known domains are at the other end, asked for by many users and with query and user patterns fluctuating over time. The qualifier well-known is mainly related to visiting frequency, obviously including domain names such as www.google.com, facebook.com and other large corporations together with known infrastructure domains such as root name-servers, Amazon and CDNs, but it is in no way synonymous with safe, uncontroversial or even non-malicious domains.
For DNS TAPIR well-known includes domains – and aggregates of domains using wildcards – that are sufficiently known by the project, but also uninteresting domains, domains to be researched later, or just plain noisy domains, and is the basis for what is included in the aggregated reports generated by the Edge system.
Between the two ends of this spectrum lies an ocean of queries, related to smaller, more local resources, and therefore also more closely connected to individual clients asking for them – for example, the pizzeria at the corner, your local numismatic society, or the kid’s school. If data polluted with personal information is toxic as Maciej Ceglowski suggests, this mid-section is probably radioactive, since it is by far the most privacy-sensitive. Our current strategy for handling this data is to throw it away, but the plan is to process data at the Edge, creating reports and training data sufficiently aggregated to disconnect them from the originating clients. The challenge is non-trivial, or so we believe.
Divvi Up
Back to Divvi Up. When letting a client send some sensitive value using Divvi Up, the label for the transferred sensitive data must not in itself be a source of information. In our case, most values are connected to a domain name, and if that domain name is google.com it might not be problematic to send a Divvi Up’ed counter for Google. If the domain is seriouslycontroversialparty.org or someweirdcult.org, the mere existence of the label is problematic. Sending all possible labels from every client is not feasible as the DNS namespace is practically infinite, and adding sufficient noise to effectively obscure the sensitive data is hard and resource intensive.
This relates to another topic for DNS data analysis, which is that the existence of data is signal, the absense of (expected) data is also signal, and the when, how much, how often, etc is also signal. It is possible to discern a number of interesting features of a client through signal analysis, and one obvious data point that can’t be hidden from either server is that the client is using some software that deploys Divvi Up. More on that at some other time.