第58节 lychee 实现 docs 检查


❤️💕💕记录sealosopen in new window开源项目的学习过程。k8s,docker和云原生的学习open in new window。Myblog:http://nsddd.topopen in new window


[TOC]

介绍

什么是 lychee?

⚡ 用Rust编写的快速、异步、基于流的链接检查器。

在Markdown、HTML、reStructuredText或任何其他文本文件或网站中查找损坏的超链接和邮件地址!

可作为命令行实用程序、库和 GitHub Action使用。

安装

在基于APT/dpkg的Linux发行版(例如Debian、Ubuntu、Linux Mint和Kali Linux),以下命令将安装所有必需的构建依赖项,包括Rust工具链和 cargo

curl -sSf 'https://sh.rustup.rs' | sh
apt install gcc pkg-config libc6-dev libssl-dev

使用

递归检查当前目录中支持的文件中的所有链接

lychee .

您还可以指定各种类型的输入:

# check links in specific local file(s):
lychee README.md
lychee test.html info.txt

# check links on a website:
lychee https://endler.dev

# check links in directory but block network requests
lychee --offline path/to/directory

# check links in a remote file:
lychee https://raw.githubusercontent.com/lycheeverse/lychee/master/README.md

# check links in local files via shell glob:
lychee ~/projects/*/README.md

# check links in local files (lychee supports advanced globbing and ~ expansion):
lychee "~/projects/big_project/**/README.*"

# ignore case when globbing and check result for each link:
lychee --glob-ignore-case --verbose "~/projects/**/[r]eadme.*"

# check links from epub file (requires atool: https://www.nongnu.org/atool)
acat -F zip {file.epub} "*.xhtml" "*.html" | lychee -

Lychee将其他文件格式解析为明文,并使用linkify提取链接。如果没有格式或编码细节,这通常工作得很好,但如果您需要对新文件格式的专门支持,请考虑创建一个问题。

Docker Usage

下面是如何将本地目录挂载到容器中,并使用lychee检查一些输入。传递 --init 参数,以便可以从终端停止荔枝。我们还传递 -it 来启动一个交互式终端,这是显示进度条所必需的。

docker run --init -it -v `pwd`:/input lycheeverse/lychee /input/README.md

GitHub Token

为了避免在检查GitHub链接时受到速率限制,您可以选择使用Github令牌设置环境变量,如 GITHUB_TOKEN=xxxx ,或使用 --github-token CLI选项。它也可以在配置文件中设置。 下面是一个示例配置文件。

令牌可以在您的GitHub帐户设置页面中生成。没有额外权限的个人令牌足以检查公共存储库链接。

Commandline Parameter参数

有大量命令行参数可用于自定义行为。下面是一个完整的列表。

A fast, async link checker

Finds broken URLs and mail addresses inside Markdown, HTML, `reStructuredText`, websites and more!

Usage: lychee [OPTIONS] <inputs>...

Arguments:
  <inputs>...
          The inputs (where to get links to check from). These can be: files (e.g. `README.md`), file globs (e.g. `"~/git/*/README.md"`), remote URLs (e.g. `https://example.com/README.md`) or standard input (`-`). NOTE: Use `--` to separate inputs from options that allow multiple arguments

Options:
  -c, --config <CONFIG_FILE>
          Configuration file to use
          
          [default: lychee.toml]

  -v, --verbose...
          Set verbosity level; more output per occurrence (e.g. `-v` or `-vv`)

  -q, --quiet...
          Less output per occurrence (e.g. `-q` or `-qq`)

  -n, --no-progress
          Do not show progress bar.
          This is recommended for non-interactive shells (e.g. for continuous integration)

      --cache
          Use request cache stored on disk at `.lycheecache`

      --max-cache-age <MAX_CACHE_AGE>
          Discard all cached requests older than this duration
          
          [default: 1d]

      --dump
          Don't perform any link checking. Instead, dump all the links extracted from inputs that would be checked

      --archive <ARCHIVE>
          Specify the use of a specific web archive. Can be used in combination with `--suggest`
          
          [possible values: wayback]

      --suggest
          Suggest link replacements for broken links, using a web archive. The web archive can be specified with `--archive`

  -m, --max-redirects <MAX_REDIRECTS>
          Maximum number of allowed redirects
          
          [default: 5]

      --max-retries <MAX_RETRIES>
          Maximum number of retries per request
          
          [default: 3]

      --max-concurrency <MAX_CONCURRENCY>
          Maximum number of concurrent network requests
          
          [default: 128]

  -T, --threads <THREADS>
          Number of threads to utilize. Defaults to number of cores available to the system

  -u, --user-agent <USER_AGENT>
          User agent
          
          [default: lychee/0.12.0]

  -i, --insecure
          Proceed for server connections considered insecure (invalid TLS)

  -s, --scheme <SCHEME>
          Only test links with the given schemes (e.g. http and https)

      --offline
          Only check local files and block network requests

      --include <INCLUDE>
          URLs to check (supports regex). Has preference over all excludes

      --exclude <EXCLUDE>
          Exclude URLs and mail addresses from checking (supports regex)

      --exclude-file <EXCLUDE_FILE>
          Deprecated; use `--exclude-path` instead

      --exclude-path <EXCLUDE_PATH>
          Exclude file path from getting checked

  -E, --exclude-all-private
          Exclude all private IPs from checking.
          Equivalent to `--exclude-private --exclude-link-local --exclude-loopback`

      --exclude-private
          Exclude private IP address ranges from checking

      --exclude-link-local
          Exclude link-local IP address range from checking

      --exclude-loopback
          Exclude loopback IP address range and localhost from checking

      --exclude-mail
          Exclude all mail addresses from checking

      --remap <REMAP>
          Remap URI matching pattern to different URI

      --header <HEADER>
          Custom request header

  -a, --accept <ACCEPT>
          Comma-separated list of accepted status codes for valid links

  -t, --timeout <TIMEOUT>
          Website timeout in seconds from connect to response finished
          
          [default: 20]

  -r, --retry-wait-time <RETRY_WAIT_TIME>
          Minimum wait time in seconds between retries of failed requests
          
          [default: 1]

  -X, --method <METHOD>
          Request method
          
          [default: get]

  -b, --base <BASE>
          Base URL or website root directory to check relative URLs e.g. https://example.com or `/path/to/public`

      --basic-auth <BASIC_AUTH>
          Basic authentication support. E.g. `username:password`

      --github-token <GITHUB_TOKEN>
          GitHub API token to use when checking github.com links, to avoid rate limiting
          
          [env: GITHUB_TOKEN]

      --skip-missing
          Skip missing input files (default is to error if they don't exist)

      --include-verbatim
          Find links in verbatim sections like `pre`- and `code` blocks

      --glob-ignore-case
          Ignore case when expanding filesystem path glob inputs

  -o, --output <OUTPUT>
          Output file of status report

  -f, --format <FORMAT>
          Output format of final status report (compact, detailed, json, markdown)
          
          [default: compact]

      --require-https
          When HTTPS is available, treat HTTP links as errors

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

错误码 (Exit codes)

  • 0 表示成功(已成功检查所有链接或已按照配置排除/跳过所有链接)
  • 1 用于丢失的输入和任何意外的运行时失败或配置错误
  • 2 表示链路检查失败(如果任何未排除的链路未通过检查)

您可以通过使用 --exclude 指定正则表达式模式(例如, --exclude example\.(com|org) )。如果当前工作目录中存在名为 .lycheeignore 的文件,则其内容也将被排除。该文件允许您列出多个要排除的正则表达式(每行一个模式)。

要从扫描中排除文件/目录,请使用 lychee.tomlexclude_path

exclude_path = ["some/path", "*/dev/*"]

Caching

如果设置了 --cache 标志,荔枝将在当前目录中名为 .lycheecache 的文件中缓存响应。如果文件存在并且设置了标志,则在启动时将加载该高速缓存。这可以大大加快未来的运行速度。请注意,默认情况下,荔枝不会在磁盘上存储任何数据。

用作自己的库

使用 rust 调用~

actions

使用荔枝快速检查Markdown、HTML和文本文件中的链接。

可以融合 issue 的使用,比如说:

与“从文件创建问题”一起使用时,当操作发现链接问题时,将打开 issue

Note 当然,这需要你的权限

Usage

以下是GitHub工作流文件的完整示例:

它将每天检查一次所有存储库链接,并在出现错误时创建一个问题。将此保存在 .github/workflows/links.yml 下:

name: Links

on:
  repository_dispatch:
  workflow_dispatch:
  schedule:
    - cron: "00 18 * * *"

jobs:
  linkChecker:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Link Checker
        id: lychee
        uses: lycheeverse/lychee-action@v1.7.0
        env:
          GITHUB_TOKEN: ${{secrets.GITHUB_TOKEN}}

      - name: Create Issue From File
        if: env.lychee_exit_code != 0
        uses: peter-evans/create-issue-from-file@v4
        with:
          title: Link Checker Report
          content-filepath: ./lychee/out.md
          labels: report, automated issue

(You不需要自己配置 GITHUB_TOKEN ;由Github自动设置)。

如果您总是希望使用最新的功能,但又避免破坏性的更改,则可以将版本替换为 lycheeverse/lychee-action@v1

注意

可以对照这个加一些参数,比如 -E 屏蔽所有 localhost 和内网ip的检查,以及忽略401和403的错误,429那个是GitHub的限速报错(可加可不加)

image-20230513195026852

Alternative approach (替代方法)

这将在任何git push事件和所有pull请求期间检查所有存储库链接。如果出现错误,操作将失败。

这样做的好处是确保在合并请求期间,不会添加任何断开的链接,并且如果任何现有链接断开,它们将被捕获。将其保存在 .github/workflows/links-fail-fast.yml 下:

name: Links (Fail Fast)

on:
  push:
  pull_request:

jobs:
  linkChecker:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Link Checker
        uses: lycheeverse/lychee-action@v1.7.0
        with:
          fail: true
        env:
          GITHUB_TOKEN: ${{secrets.GITHUB_TOKEN}}

标志

Check Linksopen in new window

END 链接