diff options
| author | Laurent Bercot <ska-skaware@skarnet.org> | 2025-12-23 23:40:36 +0000 |
|---|---|---|
| committer | Laurent Bercot <ska-skaware@skarnet.org> | 2025-12-23 23:40:36 +0000 |
| commit | 58a3b631542da268195c9ad8cf019e45e8584bcd (patch) | |
| tree | b013e75a4b36edd40e2d0f642985cd9833106c53 | |
| parent | a15bb963a5623451f2941e441351915423b988fd (diff) | |
| download | tipidee-58a3b631542da268195c9ad8cf019e45e8584bcd.tar.gz | |
Document tipidee-logaggregate and cgiwrapper-nollmcrawler
| -rw-r--r-- | doc/cgiwrapper-nollmcrawler.html | 165 | ||||
| -rw-r--r-- | doc/index.html | 8 | ||||
| -rw-r--r-- | doc/tipidee-logaggregate.html | 108 | ||||
| -rw-r--r-- | src/misc/cgiwrapper-nollmcrawler.c | 2 | ||||
| -rw-r--r-- | src/misc/tipidee-logaggregate.c | 3 |
5 files changed, 282 insertions, 4 deletions
diff --git a/doc/cgiwrapper-nollmcrawler.html b/doc/cgiwrapper-nollmcrawler.html new file mode 100644 index 0000000..3fa5d2b --- /dev/null +++ b/doc/cgiwrapper-nollmcrawler.html @@ -0,0 +1,165 @@ +<html> + <head> + <meta name="viewport" content="width=device-width, initial-scale=1.0" /> + <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> + <meta http-equiv="Content-Language" content="en" /> + <title>tipidee: the cgiwrapper-nollmcrawler program</title> + <meta name="Description" content="tipidee: the cgiwrapper-nollmcrawler program" /> + <meta name="Keywords" content="tipidee web cgi wrapper llm crawler protection s6-tcpserver-access" /> + <!-- <link rel="stylesheet" type="text/css" href="//skarnet.org/default.css" /> --> + </head> +<body> + +<p> +<a href="index.html">tipidee</a><br /> +<a href="//skarnet.org/software/">Software</a><br /> +<a href="//skarnet.org/">skarnet.org</a> +</p> + +<h1> The <tt>cgiwrapper-nollmcrawler</tt> program </h1> + +<p> + <tt>cgiwrapper-nollmcrawler</tt> is a very ad-hoc, quick-and-dirty protection +against LLM crawler bots for installations that run tipidee under super-servers from +<a href="//skarnet.org/software/s6-networking/">s6-networking</a>. tipidee servers +cannot run an anti-crawler solution like +<a href="https://anubis.techaro.lol/">Anubis</a> and need alternative protections. +</p> + +<p> +cgiwrapper-nollmcrawler is a chainloading program that you wrap your CGI program +with. It takes a regular expression on the command line; if a new client connects +to the server and hits the CGI program with a query string that matches the +regular expression, the request is denied and the IP of the client is immediately +blacklisted. Otherwise, the client is whitelisted and can hit any URL on the +server. +</p> + +<p> + This takes advantage of the LLM crawler propensity to hit servers from random +IPs with random deep queries, while minimizing false positives from real users, +who rarely make a deep query on their first visit. +</p> + +<div id="interface"> +<h2> Interface </h2> +</div> + +<p> + As a CGI program: +</p> +<pre> + cgiwrapper-nollmcrawler [ -f ] [ -v <em>verbosity</em> ] [ -d <em>depth</em> ] <em>rulesdir</em> <em>regex</em> <em>realcgi...</em> +</pre> + +<ul> + <li> cgiwrapper-nollmcrawler expects to be run by tipideed as a CGI program, +as a wrapper around <em>realcgi...</em>, which must also, obviously, be runnable +as a CGI program. </li> + <li> It expects <em>rulesdir</em> to be the access rules directory given as argument +to the <tt>-i</tt> option to +<a href="//skarnet.org/software/s6-networking/s6-tcpserver-access.html">s6-tcpserver-access</a> +on the tipidee command line. This directory must be writable by the user cgiwrapper-nollmcrawler +is running as (so, typically, the user running the tipideed process). <em>rulesdir</em> must +follow a specific format, see below. </li> + <li> When cgiwrapper-nollmcrawler is invoked, it first checks whether the client has previously +been whitelisted in <em>rulesdir</em>. In that case, it execs into <em>realcgi...</em> immediately. + <li> Then it checks whether the depth of the PATH_INFO variable against <em>depth</em>. +If the contents of PATH_INFO have <em>depth</em> slashes (<tt>/</tt>) or fewer, the query is +allowed and the client is whitelisted. </li> + <li> Then it checks the contents of the QUERY_STRING variable against <em>regex</em>. If +the query string <em>matches</em>, then cgiwrapper-nollmcrawler blacklists the client in +<em>rulesdir</em> and responds a status 403 with an ungracious message. </li> + <li> If the query string does not match <em>regex</em>, then the client is whitelisted +and cgiwrapper-nollmcrawler execs into <em>realcgi...</em>. </li> +</ul> + +<div id="accessrules-format"> +<h2> Access rules format </h2> +</div> + +<ul> + <li> <tt><em>rulesdir</em>/ip4</tt> must exist if <em>rulesdir</em> performs access +control for IPv4 addresses, and <tt><em>rulesdir</em>/ip6</tt> must exist if +<em>rulesdir</em> performs access control for IPv6 addresses. This is the standard +access rules directory structure. </li> + <li> The <tt><em>rulesdir</em>/outputs/allow/allow</tt> and +<tt><em>rulesdir</em>/outputs/deny/deny</tt> files must also exist. They can be empty. </li> +</ul> + +<p> + This permits the following implementation: +</p> + +<ul> + <li> When cgiwrapper-llmcrawler <em>whitelists</em> a client, it just means it symlinks +<tt>../outputs/allow</tt> to the canonical +<a href="//skarnet.org/software/s6-networking/s6-tcpserver-access.html">s6-tcpserver-access</a> +format for the client's IP, in either <tt><em>rulesdir</em>/ip4</tt> or +<tt><em>rulesdir</em>/ip6</tt>. </li> + <li> When cgiwrapper-llmcrawler <em>blacklists</em> a client, it just means it symlinks +<tt>../outputs/deny</tt> instead. </li> + <li> This ensures each entry only uses one inode, and as little room as possible. </li> +</ul> + +<p> + LLM crawler bots are ruthless and can attack from <em>millions</em> of IPs, which is why +efficiency is important. Implementing a ban with just a <tt>symlink()</tt> is efficient. +</p> + +<div id="commonusage"> +<h2> Common usage </h2> +</div> + +<ul> + <li> cgiwrapper-nollmcrawler expects to be run by tipideed as a CGI program, +as a wrapper around <em>realcgi...</em>. + <li> e.g. if the URL you want to protect is <tt>https://example.com/cgit.cgi</tt>, +and <tt>cgit.cgi</tt> is a direct cgit binary, then the way to protect it is: + <ul> + <li> Move <tt>cgit.cgi</tt> to <tt>cgit.cgi-real</tt> and <em>never link this resource anywhere</em>. </li> + <li> Write a script (shell, execline, whatever language you want) standing in for <tt>cgit.cgi</tt> +that execs into cgiwrapper-nollmcrawler with <tt>cgit.cgi-real</tt> as its last argument. Make it executable. </li> + </ul> </li> + <li> cgiwrapper-nollmcrawler is typically be used to protect cgit, but it can +protect any backend that uses CGI as its interface and has deep URLs with easily +identifiable query strings. </li> +</ul> + +<div id="exitcodes"> +<h2> Exit codes </h2> +</div> + +<dl> + <dt> 0 </dt> <dd> Success. </dd> + <dt> 100 </dt> <dd> Bad usage. </dd> + <dt> 111 </dt> <dd> System call failed. This usually signals an issue with the +underlying operating system. </dd> +</dl> + +<div id="options"> +<h2> Options </h2> +</div> + +<dl> + <dt> -4 </dt> + <dd> Expect IPv4 addresses. Use this option when reading logs from a server listening +to an IPv4 address. </dd> + + <dt> -6 </dt> + <dd> Expect IPv6 addresses. Use this option when reading logs from a server listening +to an IPv6 address. </dd> +</dl> + +<div id="notes"> +<h2> Notes </h2> +</div> + +<ul> + <li> This <a href="https://social.treehouse.systems/@ska/115384879517972291">Fediverse +thread</a> tells the story of how cgiwrapper-nollmcrawler came to be, and how it was +deployed on skarnet.org. </li> +</ul> + +</body> +</html> diff --git a/doc/index.html b/doc/index.html index 2b1ce4d..ea7e18b 100644 --- a/doc/index.html +++ b/doc/index.html @@ -176,16 +176,20 @@ the previous versions of tipidee and the current one. </li> <ul> <li><a href="tipideed.html">The <tt>tipideed</tt> program</a></li> -<li><a href="tipidee-config.html">The <tt>tipidee-config</tt> program</a></li> +<li><a href="tipidee-config.html">The <tt>tipidee-config</tt> program</a></li> <br> + +<li><a href="ls.cgi.html">The <tt>ls.cgi</tt> internal program</a></li> +<li><a href="tipidee-logaggregate.html">The <tt>tipidee-logaggregate</tt> internal program</a></li> +<li><a href="cgiwrapper-nollmcrawler.html">The <tt>cgiwrapper-nollmcrawler</tt> internal program</a></li> </ul> <h3> Internal commands </h3> <ul> <li><a href="tipidee-config-preprocess.html">The <tt>tipidee-config-preprocess</tt> internal program</a></li> -<li><a href="ls.cgi.html">The <tt>ls.cgi</tt> internal program</a></li> </ul> + <h3> Configuration format </h3> <ul> diff --git a/doc/tipidee-logaggregate.html b/doc/tipidee-logaggregate.html new file mode 100644 index 0000000..7bea86b --- /dev/null +++ b/doc/tipidee-logaggregate.html @@ -0,0 +1,108 @@ +<html> + <head> + <meta name="viewport" content="width=device-width, initial-scale=1.0" /> + <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> + <meta http-equiv="Content-Language" content="en" /> + <title>tipidee: the tipidee-logaggregate program</title> + <meta name="Description" content="tipidee: the tipidee-logaggregate program" /> + <meta name="Keywords" content="tipidee web server log aggregator http skarnet.org skarnet software httpd" /> + <!-- <link rel="stylesheet" type="text/css" href="//skarnet.org/default.css" /> --> + </head> +<body> + +<p> +<a href="index.html">tipidee</a><br /> +<a href="//skarnet.org/software/">Software</a><br /> +<a href="//skarnet.org/">skarnet.org</a> +</p> + +<h1> The <tt>tipidee-logaggregate</tt> program </h1> + +<p> + <tt>tipidee-logaggregate</tt> is a very ad-hoc, quick-and-dirty log aggregator +for tipidee. +</p> + +<div id="interface"> +<h2> Interface </h2> +</div> + +<pre> + tipidee-logaggregate [ -4 | -6 ] +</pre> + +<ul> + <li> tipidee-logaggregate reads a series of log entries on its stdin. </li> + <li> It aggregates the logs, and prints what it finds to stdout. For every +client IP that hit the server, it prints that IP, followed by all the URLs +that the client requested. </li> +</ul> + +<div id="log-format"> +<h2> Log format </h2> +</div> + +<p> + tipidee-logaggregate was written for a very specific situation and is only +provided as a convenience. No effort has been made to try and make it generic, +so it expects a precise log format: +</p> + +<ul> + <li> The <tt>log</tt> directive in <tt>/etc/tipidee.conf</tt> must contain at +least the following: <tt>log start ip request resource</tt> </li> + <li> The log lines must start with a TAI64N label. This is achieved by running +<a href="//skarnet.org/software/s6/s6-log.html">s6-log</a> as the logging program +with the <strong><tt>t</tt></strong> directive. </li> +</ul> + +<p> + If these conditions are not met, tipidee-logaggregate will not work properly. +</p> + +<div id="commonusage"> +<h2> Common usage </h2> +</div> + +<p> +<code> cat *.s current | tipidee-logaggregate > result </code> +</p> + +<div id="exitcodes"> +<h2> Exit codes </h2> +</div> + +<dl> + <dt> 0 </dt> <dd> Success. </dd> + <dt> 100 </dt> <dd> Bad usage. </dd> + <dt> 111 </dt> <dd> System call failed. This usually signals an issue with the +underlying operating system. </dd> +</dl> + +<div id="options"> +<h2> Options </h2> +</div> + +<dl> + <dt> -4 </dt> + <dd> Expect IPv4 addresses. Use this option when reading logs from a server listening +to an IPv4 address. </dd> + + <dt> -6 </dt> + <dd> Expect IPv6 addresses. Use this option when reading logs from a server listening +to an IPv6 address. </dd> +</dl> + +<div id="notes"> +<h2> Notes </h2> +</div> + +<ul> + <li> If you feed tipidee-logaggregate logs starting from a random moment in time +when tipideed has already been serving, some warnings are normal and expected. +They correspond to the already-connected clients that tipidee-logaggregate cannot +identify. Unless they repeat for a large number of lines, these warnings are harmless. </li> +</ul> + +</body> +</html> diff --git a/src/misc/cgiwrapper-nollmcrawler.c b/src/misc/cgiwrapper-nollmcrawler.c index 95a2ea5..dc566a5 100644 --- a/src/misc/cgiwrapper-nollmcrawler.c +++ b/src/misc/cgiwrapper-nollmcrawler.c @@ -16,7 +16,7 @@ #include <skalibs/fmtscan.h> #include <skalibs/exec.h> -#define USAGE "as a CGI script: cgiwrapper-nollmcrawler [ -v verbosity ] [ -d pathinfodepth ] rulesdir badregex realcgit..." +#define USAGE "as a CGI script: cgiwrapper-nollmcrawler [ -v verbosity ] [ -d pathinfodepth ] rulesdir badregex realcgi..." #define dieusage() strerr_dieusage(100, USAGE) enum golb_e diff --git a/src/misc/tipidee-logaggregate.c b/src/misc/tipidee-logaggregate.c index 7b61579..57a58bd 100644 --- a/src/misc/tipidee-logaggregate.c +++ b/src/misc/tipidee-logaggregate.c @@ -5,6 +5,7 @@ #include <stdlib.h> #include <errno.h> #include <limits.h> +#include <unistd.h> #include <skalibs/uint64.h> #include <skalibs/bytestr.h> @@ -442,5 +443,5 @@ int main (int argc, char const *const *argv) buffer_putsflush(buffer_1, " ips\n") ; } (void)avltree_iter(&ipinfo_map, &print_iter, 0) ; - return 0 ; + _exit(0) ; } |
