3.1.1
Methodology
Discovering trackers
To detect services that offer CNAME-based tracking, we used a three-pronged approach that leverages features intrinsic to the ecosystem, combining both automated and manual analysis.
First we filtered all requests from HTTP Archive’s dataset and only considered the ones that were same-site but not same-origin, i.e. the same eTLD+1 but not the exact same origin as the visited web page. Furthermore, we only retained requests to domain names that returned a CNAME record referring (either directly or indirectly after redirection of other CNAME records) to a different eTLD+1 domain in our DNS data. We aggregated these requests on the eTLD+1 of the CNAME record, and recorded a variety of information, such as the average number of requests per website, variation of request size, percentage of requests that contain a cookie or set one via the HTTP response header, etc. In Appendix B we elaborate on these features and discuss how they could be used to assist or automate the detection of CNAME-based tracking. Out of the resulting 46,767 domains, we only consider the ones that are part of a CNAME-chain on at least 100 different websites, which leaves us with 120 potential CNAME-based trackers.
In the second phase, we performed a manual analysis to rule out services that have no strict intention to track users. Many services that are unrelated to tracking, such as CDNs, use a same-site subdomain to serve content, and may also set a cookie on this domain, thus giving them potential tracking capabilities. For instance, Cloudflare sets a_cfduidcookie in order to detect malicious visitors, but does not intend to track users with this cookie (user information is kept less than 24 hours) [12]. For each of the 120 domains, we visited the web page of the related organization (if available) and gathered information about the kind of service(s) it provides according to the information and documentation provided on its website. Based on this information, we then determined whether tracking was the main service provided by this company, either because it explicitly indicated this, or tracking would be required for the main advertised product, e.g. in order to provide users with personalized content, or whether this was clear from the way the products were marketed.
For instance one such provider, Pardot offers a service named “Marketing Automation”, which they define as “a technology that helps businesses grow by automating marketing processes, tracking customer engagement, and delivering personalized experiences to each customer across marketing, sales, and service”, indicating that customers (website visitors) may be tracked.
Finally, we validate this based on the requests sent to the purported tracker when visiting a publisher website: we only consider a company to be a tracker when a uniquely identifying parameter is stored in the browser and sent along with subsequent requests, e.g. via a cookie or using local Storage. Using this method, we found a total of 5 trackers. Furthermore, we extended the list with eight trackers from the CNAME cloaking blocklist by NextDNS [13, 37].