TOPIC:

Computer Security: Tigers and cats against crawling mAIce

Written by:

Computer Security Office

IT has for some time been transitioning into a new era where artificial intelligence (AI) takes over more and more tasks and, eventually, responsibility. There are few of us left who don’t use the plethora of ChatGPT, Mistral, Claude and the like. Performance gain and increased efficiency are announced to be everywhere and might lead to a revolution in the job market, in programming, in customer service, in research. And, more sadly, in warfare. However, in all cases, the “intelligence” in AI must come from somewhere – and this is where the problems start… with greed and non-compensation.

Similar to the era of search engines, when Lycos, Yahoo!, AltaVista and Google crawled the web in an unconstrained manner to build up their indexes, the ChatGPTs of today are again crawling the web to train their large language models (LLMs). We are back to square one. History repeating itself. But while search engines later learned to respect the so-called “robots.txt”-file defining what content could be crawled and what content should be spared, and while the discussion on who owned the crawled content has somehow been held, LLMs enter territory like the settlers in the old Wild West. Copyrights don’t count. File ownerships don’t count. Robots.txt don’t count. There are no barriers, speed limits or traffic restrictions. The more the better. The faster the better. Greed has entered the race; copyrights, ownership and load problems are ignored. The strongest prevails, all others can sue later.

While the goal here is not to delve into the “copyright” problematic, which has been discussed in depth (but not concluded on) elsewhere, let’s discuss the more fundamental problem of AI training − unconstrained crawling. Like many other web service providers, CERN struggles with heavy LLM training trying to pull in a massive amount of data from CERN web services, making simultaneous requests from many uncorrelated sources and, hence, adding a heavy load on those web services. In fact, these training requests closely resemble malicious bot-like distributed denial-of-service (DDoS) attacks. And as with any kind of such widely distributed unsolicited requests, it is hard to distinguish between legitimate (user) connections and LLM training/DDoS attacks.

Standard incident response kicks in in either case and the CERN Computer Security Office engages in a cat-and-mouse game to block any unsolicited connections while (trying to) maintain the core web applications from service to legitimate users. Sometimes with more success, sometimes with less (as can be read in our monthly reports from September and November last year). Unfortunately, crawlers/DDoS attacks come from manifold different IP addresses (hence the “D” for “distributed”), meaning that blocking illegitimate traffic at the TCP/IP level by individual IP addresses or IP address ranges becomes cumbersome, resource consuming and impossible. On the other hand, blocking at the ASN level or even traffic from whole countries creates some significant collateral damage.

Tigers to the rescue! At the web application level, visibility of the incoming traffic is much more detailed than on the basic network traffic level (in particular for CERN, where all encrypted communication is left intact in order to preserve privacy). Hence, web service managers have a few more general and mandatory technical possibilities to best protect their IT services against a multitude of DDoS attacks and AI crawlers. And, joining forces, those tigers and cats might also come up with more centralised means (like web proxies, web application firewalls, load-balancers, CDNs) to further protect CERN against the crawling mice of AI. Until they respect “robots.txt”-files (or similar). Stay tuned.

_________

Do you want to learn more about computer security incidents and issues at CERN? Follow our Monthly Report. For further information, questions or help, check our website or contact us at Computer.Security@cern.ch.

Related Articles

No posts were found. Try to change the category or the date filters.