用于提供现代 PyPI 镜像的 NGINX 配置

Harry Chen’s Blog (Shengqi Chen (i@harrychen.xyz))

众所周知，在镜像站界，PyPI 是个难伺候的主：大量的硬盘占用、巨大的流量、频繁的更新，还有不靠谱的同步工具 bandersnatch。你说为什么不靠谱？听说过其他 没有能力 删除上游删掉了的文件的同步工具吗？

好在，伟大的 taoky 在去年写出了一个科学的同步工具 shadowmire，其中一大重要功能，就是能移除上游已经消失的版本和相关文件，为广大教育网镜像站节省了不少硬盘空间，也缓解了供应链攻击的风险。

然而，提供用户可用的 PyPI 镜像依旧没有那么简单。任何 HTTP 站点要被视作有效的 Python 仓库，需要遵守 Simple repository API 或者称为 Index API 的要求。具体来说，两种类型的 API：列出仓库所有的项目（包），以及列出每个包的信息；而每种 API 又需要提供 JSON 和 HTML 两种格式，根据客户端的 Content-Type 请求头来决定返回哪种格式。当然，这些规范并不是一蹴而就的，最早 PEP503 定义了所谓的“简单”格式（Simple API），PEP691 则在此基础上添加了 JSON 格式的支持，最终形成了现在的标准。

而 PyPI 本身的复杂程度不止如此，它又提供了 XMLRPC API，以及更现代的 JSON API。pip 或者 uv 等工具并不会用到这些 API，但镜像站的同步需要用到它们来获取包的元数据（比如最重要的：最后更新时间）。因此，我们也可以提供力所能及的服务。

考虑到镜像站一般只提供静态文件服务，因此需要同步工具与 NGINX 配合，才能满足上述要求。目前 Shadowmire 对于一个包 foo，会生成如下的文件夹结构：

packages/ # 存放实际的 wheel 文件，略去
simple/ # Simple API
- index.html -> index.v1_html
- index.v1_html
- index.v1_json
- foo/ # 包 foo 的子目录
  - index.html -> index.v1_html
  - index.v1_html
  - index.v1_json
- .../
json/ # JSON API
- foo # 包 foo 的 JSON API 内容（如果能从上游获取）

首先，对于 Simple API 的目录，我们需要配置 NGINX 根据 HTTP 请求头来返回不同的文件：

# 放置在 server block 外
map $http_accept $pypi_mirror_suffix {
    default ".html";
    "~*application/vnd\.pypi\.simple\.v1\+json" ".v1_json";
    "~*application/vnd\.pypi\.simple\.v1\+html" ".v1_html";
    "~*text/html" ".html";
}

# 放置在 server block 内
# 此处 /pypi/web/simple/ 是 TUNA 长期使用的路径前缀，可以根据需要修改
location ~ ^/pypi/web/simple/[^/]* { # match simple/ and simple/foo
    index index$pypi_mirror_suffix index.html;
    types {
        application/vnd.pypi.simple.v1+json v1_json;
        application/vnd.pypi.simple.v1+html v1_html;
        text/html html;
    }
    default_type "text/html";
    try_files $uri$pypi_mirror_suffix $uri $uri/ =404;
}

其次，对于 JSON API（/pypi/&LTpkg_name>/json），我们需要倒转路径中的包名和 json 两部分，以适应 Shadowmire 下载的目录结构（/pypi/json/&LTpkg_name>）：

# PyPI JSON API: https://warehouse.pypa.io/api-reference/json.html
# pattern: /pypi/&LTpackage_name>/json
location ~ ^/pypi/[^/]+/json$ {
    rewrite ^/pypi/([^/]+)/json$ /pypi/web/json/$1 break;
    types { }
    default_type "application/json; charset=utf-8";
}

然而，事情到这里还没有结束。PyPI 还额外要求，路径中所有的包名都要进行“正规化”（normalization），即将大写字母转换为小写字母，并将 -_. 这三个字符都替换为连字符 -。Shadowmire 在下载时会自动处理这个问题，但在 NGINX 配置中，我们也需要确保用户请求路径被正确地正规化。因此，需要通过 njs 脚本来处理这些路径转换：

// pypi.njs

function canonicalizeName(n) {
  let l = n.toLowerCase();
  // njs < 0.7.10 does not have `String.replaceAll`
  for (let i = 0; i < l.length; i++) {
    if (l[i] === '_' || l[i] === '.') {
      l = l.substring(0, i) + '-' + l.substring(i + 1);
    }
  }
  return l;
}

/// &LTreference path="ngx_http_js_module.d.ts" />
/**
 * @param {NginxHTTPRequest} r
 */
function redirectToCanonicalizedName(r) {

  const uri = r.uri.trim();
  r.log(`pypi.njs: original URI to canonicalize: ${uri}`);

  const parts = uri.split('/');
  let matched = false;

  if (parts.length >= 5) {
    // match `/pypi/web/simple/&LTpkg_name>`
    if (parts[1] === 'pypi' && parts[2] === 'web' && parts[3] === 'simple') {
      parts[4] = canonicalizeName(parts[4]);
      matched = true;
    }
  } 
  
  if (!matched && parts.length >= 4) {
    // match `/pypi/&LTpkg_name>/json`
    if (parts[1] === 'pypi' && parts[3] === 'json') {
      parts[2] = canonicalizeName(parts[2]);
      matched = true;
    }
  }

  if (!matched) {
    r.warn(`pypi.njs: unknown redirection for URL ${uri}`);
    r.return(500);
  } else {
    const newUri = parts.join('/');
    r.log(`pypi.njs: redirecting to new URI: ${newUri}`);
    r.return(302, newUri);
  }

}

export default { redirectToCanonicalizedName }

继续增加 NGINX 配置处理这些路径（要放置在上面配置之前，以优先处理）：

js_import pypi from pypi.njs;
# explicitly disable canonicalization of these URLs with dots
location = /pypi/web/simple/index.html {}
location = /pypi/web/simple/index.v1_html {}
location = /pypi/web/simple/index.v1_json {}
# match urls with unnormalized names and handle to js
location ~ ^/pypi/web/simple/[^/]*[A-Z_.][^/]* {
    js_content pypi.redirectToCanonicalizedName;
}
location ~ ^/pypi/[^/]*[A-Z_.][^/]*/json {
    js_content pypi.redirectToCanonicalizedName;
}

这里为了减少 JS 的调用量，只匹配了“未正规化”的 URL，即包含大写字母或者 _. 字符的。然而，这样写就会把 /pypi/web/simple/index.html 之类的 URL 也匹配进去，产生非预期的错误（如 pip 请求 /simple/ 的 JSON 格式，首先被 NGINX 重写为 /simple/index.v1_json，又被上述规则命中重写成 /simple/index-v1-json，最终导致 404）。考虑到 simple/ 目录下目前只有这几个文件名可能被访问，因此我们可以通过 location = 的精确匹配来显式禁用这些 URL 的特殊处理，避免误伤。

最后，为了用户体验（是的，我的 Chrome 被卡死过若干次），也为了降低服务器负担，还可以增加规则禁用一些路径的浏览器访问（依旧需要插入到在上述规则的更前面，使其最早生效）：

# server block 外
map $http_user_agent $is_browser {
  default 0;
  "~*validation server" 0;
  "~*mozilla" 1;
}

# server block 内
# disable browser viewing of some too large directories / files in PyPI
location ~ ^/pypi/web/simple/(index\.(html|v1_html)|json/|pypi/)$ {
    default_type 'text/html';
    if ($is_browser) {
        return 413 "This page is too large for browsers.";
    }
}

这样，终于可以提供一个比较完整的“现代” PyPI 镜像服务了。如果你也在尝试搭建 PyPI 镜像站，希望能有所帮助。

Generated by RSStT. The copyright belongs to the original author.

Source

用于提供现代 PyPI 镜像的 NGINX 配置

Report Page