Re: Factor

Factor: the language, the theory, and the practice.

Offline Wikipedia

Tuesday, May 16, 2023

#parsing #web #wikipedia

Pretty much everyone agrees that Wikipedia is awesome (except maybe during one of their controversial fundraising campaigns). In addition to Wikipedia, the Wikimedia Foundation operates:

Even though the official Wikipedia iOS app and Wikipedia Android app are both great, they still require access to the internet to be useful. I am not alone when wondering how to build your own Hitchhiker’s Guide with Wikipedia and looking through the options to download a Wikipedia database.

One way you can do this is to implement support for the ZIM file format, for example using the libzim project. There are many archives available to download as a ZIM file for Wikipedia and various popular websites like StackOverflow, Project Gutenberg, and even some open source projects. You can also build your own ZIM file if you want to archive custom content.

ZIM stands for “Zeno IMproved”, as it replaces the earlier Zeno file format. Its file compression uses LZMA2, as implemented by the xz-utils library, and, more recently, Zstandard. The openZIM project is sponsored by Wikimedia CH, and supported by the Wikimedia Foundation.

Let’s implement this using Factor!

Each ZIM file starts with a header in little endian format:

PACKED-STRUCT: zim-header
    { magic-number uint32_t }
    { major-version uint16_t }
    { minor-version uint16_t }
    { uuid uint64_t[2] }
    { entry-count uint32_t }
    { cluster-count uint32_t }
    { url-ptr-pos uint64_t }
    { title-ptr-pos uint64_t }
    { cluster-ptr-pos uint64_t }
    { mime-list-ptr-pos uint64_t }
    { main-page uint32_t }
    { layout-page uint32_t }
    { checksum-pos uint64_t } ;

In addition to 16-bit, 32-bit, and 64-bit little-endian numbers, we need to be able to read null-terminated strings typically stored as UTF-8. For example, when reading the mime-type list:

: read-string ( -- str )
    { 0 } read-until 0 assert= utf8 decode ;

: read-mime-types ( -- seq )
    [ read-string dup empty? not ] [ ] produce nip ;

That’s enough to parse the header file, the list of mime-types, and the lists of pointers to urls, titles, and clusters used for indexing into the ZIM file.

TUPLE: zim path header mime-types urls titles clusters ;

: read-zim ( path -- zim )
    dup binary [
        zim-header read-struct dup {
            [ magic-number>> 0x44D495A assert= ]
            [
                mime-list-ptr-pos>> seek-absolute seek-input
                read-mime-types
            ] [
                dup url-ptr-pos>> seek-absolute seek-input
                entry-count>> [ 8 read le> ] replicate
            ] [
                dup title-ptr-pos>> seek-absolute seek-input
                entry-count>> [ 4 read le> ] replicate
            ] [
                dup cluster-ptr-pos>> seek-absolute seek-input
                cluster-count>> [ 8 read le> ] replicate
            ]
        } cleave zim boa
    ] with-file-reader ;

Entries

There are two types of directory entries:

  1. content entries
TUPLE: content-entry mime-type parameter-len namespace
    revision cluster-number blob-number url title parameter ;

: read-content-entry ( mime-type -- content-entry )
    read1
    read1
    4 read le>
    4 read le>
    4 read le>
    read-string
    read-string
    f
    content-entry boa
    dup parameter-len>> read >>parameter ;
  1. redirect entries
TUPLE: redirect-entry mime-type parameter-len namespace revision
    redirect-index url title parameter ;

: read-redirect-entry ( mime-type -- redirect-entry )
    read1
    read1
    4 read le>
    4 read le>
    read-string
    read-string
    f
    redirect-entry boa
    dup parameter-len>> read >>parameter ;

The mime-type indicates which type of entry we are reading:

: read-entry ( -- entry )
    2 read le> dup 0xffff =
    [ read-redirect-entry ] [ read-content-entry ] if ;

Now we can read the entry at index n in a ZIM file:

: read-entry-index ( n zim -- entry/f )
    urls>> nth seek-absolute seek-input read-entry ;

Clusters

Content is stored as clusters of data, where each cluster is a sequence of binary blobs contained at an offset into the cluster. And the cluster is stored either uncompressed or with optional compression (typically LZMA or ZStandard).

We can read the “no compression” version:

: read-cluster-none ( -- offsets blobs )
    4 read le>
    [ 4 /i 1 - [ 4 read le> ] replicate ] [ prefix ] bi
    dup [ last ] [ first ] bi - read ;

And then read the “ZStandard compression” version:

: read-cluster-zstd ( -- offsets blobs )
    zstd-uncompress-stream-frame dup uint32_t deref
    [ 4 /i uint32_t <c-direct-array> ] [ tail-slice ] 2bi
    2dup [ [ last ] [ first ] bi - ] [ length assert= ] bi* ;

The cluster can then be read by checking the compression type in use:

: read-cluster ( -- offsets blobs )
    read1 [ 5 bit? f assert= ] [ 4 bits ] bi {
        { 1 [ read-cluster-none ] }
        { 2 [ "zlib not supported" throw ] }
        { 3 [ "bzip2 not supported" throw ] }
        { 4 [ "lzma not supported" throw ] }
        { 5 [ read-cluster-zstd ] }
    } case ;

To read the blob at index n, we read the entire cluster, then offset into the blobs data:

:: read-cluster-blob ( n -- blob )
    read-cluster :> ( offsets blobs )
    0 offsets nth :> zero
    n offsets nth :> from
    n 1 + offsets nth :> to
    from to [ zero - ] bi@ blobs subseq ;

Now we can read the blob by index into a given cluster in a ZIM file:

: read-blob-index ( blob-number cluster-number zim -- blob )
    clusters>> nth seek-absolute seek-input read-cluster-blob ;

And we can read the entry content from each entry type or index:

GENERIC#: read-entry-content 1 ( entry zim -- blob mime-type )

M:: content-entry read-entry-content ( entry zim -- blob mime-type )
    entry blob-number>>
    entry cluster-number>>
    zim read-blob-index
    entry mime-type>>
    zim mime-types>> nth ;

M: redirect-entry read-entry-content
    [ redirect-index>> ] [ read-entry-content ] bi* ;

M: integer read-entry-content
    [ read-entry-index ] keep '[ _ read-entry-content ] [ f f ] if* ;

Reading the “main page” content is simple using the index stored in the ZIM header:

: read-main-page ( zim -- blob/f mime-type/f )
    [ header>> main-page>> ] [ read-entry-content ] bi ;

We can find an entry by searching using a namespace and url, taking advantage of the fact the entries are sorted by <namespace><url> to perform a binary search. Some common namespaces include:

  • A - Article
  • C - User Content
  • M - ZIM metadata
  • W - Well known entries
  • X - Search indexes
:: find-entry-url ( namespace url zim -- entry/f )
    f zim header>> entry-count>> <iota> [
        nip zim read-entry-index
        namespace over namespace>> <=>
        dup +eq+ = [ drop url over url>> <=> ] when
    ] search 2drop dup {
        [ ] [ namespace>> namespace = ] [ url>> url = ]
    } 1&& [ drop f ] unless ;

If we find the entry after searching, we can read it’s content:

: read-entry-url ( namespace url zim -- blob/f mime-type/f )
    [ find-entry-url ] keep '[ _ read-entry-content ] [ f f ] if* ;

Web Server

This is all kinda awesome, but basically these ZIM files hold HTML data for an offline instance of the various wiki-type servers. So, wouldn’t it be awesome to make a HTTP server responder that loads a ZIM file and then returns data from it on a local Factor HTTP server?

Yes!

TUPLE: zim-responder zim ;

: <zim-responder> ( path -- zim-responder )
    read-zim zim-responder boa ;

M: zim-responder call-responder*
    [
        dup { [ length 1 > ] [ first length 1 = ] } 1&&
        [ unclip-slice first ] [ CHAR: A ] if swap "/" join
    ] dip [
        zim>> dup path>> binary [
            over empty? [ read-entry-url ] [ 2nip read-main-page ] if
        ] with-file-reader
    ] bi* 2dup and [
        <content> binary >>content-encoding
    ] [
        2drop <404>
    ] if ;

We use that to make a little entry point that creates a zim-responder and then sets it as the main-responder and calls httpd to start a web server. Using the latest development version, we can run it like so:

$ ./factor -run=zim.server /path/to/wiki.zim [port]

There are few features that would be nice to add – like searching URLs, titles, and content, or dealing with split ZIM files (when over 4GB on file systems like FAT32) – but this is a pretty sweet neat new tool we have available now in a nightly build and will be released soon in Factor 0.99.