Rosetta Code Downloaded

Wednesday, July 17, 2024

I have been quite distracted by Rosetta Code shenanigans in the past few days. For anyone that is curious about the backstory, I removed the rosetta-code vocabulary from the main Factor git repository, and then decided to archive all the Factor solutions to a separate factor-rosetta-code repository for posterity, utility, and analysis.

I thought it would be fun to talk about different ways to do web scraping for the Rosetta Code website, ultimately choosing to write a Factor vocabulary to do it, which I’ll go into below.

Public Datasets

There are a couple public datasets and scraper tools that you can use:

Hugging Face has a @christopher/rosetta-code dataset, but it looks like it was updated “about 2 years ago”, so perhaps doesn’t contain recently contributed solutions.
The @acmeism/RosettaCodeData repository seems to be quite up-to-date and uses a RosettaCode CPAN module and the MediaWiki API to synchronize their data files periodically. At the moment, this is over 600 MB to clone.
The @brollb/rosetta-code-scraper repository is a Rust scraper that uses the reqwest and scraper crates to parse the web pages and extract task descriptions and solutions. This required some minor tweaks to get working with a recent Rust version, and had some issues with the newer HTML being generated.

Using Factor

I started with the previous approaches, but then I realized that I kinda wanted to build my own program that only grabbed the Factor solutions and weaved them into solution files with the task description for the factor-rosetta-code repository.

After considering using the rendered HTML from the Rosetta Code website, I decided it would be a lot simpler to use the MediaWiki API and our mediawiki.api vocabulary to extract the tasks. That vocabulary requires an endpoint to be specified, so we define a simple combinator that sets it before running a quotation.

: with-rosetta-code ( quot -- )
    [ "https://rosettacode.org/w/api.php" endpoint ] dip
    with-variable ; inline

The Rosetta Code solutions consists of a list of pages as well as sub-categories with their own lists of pages. We need a list-category word that will get the members of a given category, memoized in case pages reference each other or to speed up subsequent calls through caching:

MEMO: list-category ( title -- members )
    'H{ { "list" "categorymembers" } { "cmtitle" _ } }
    query [ "title" of ] map ;

And a list-categories word that will recursively resolve categories containing other categories:

: list-categories ( title -- tasks )
    list-category [ "Category:" head? ] partition swap
    [ list-categories ] map concat append harvest members sort ;

Using these, we can retrieve all tasks and draft tasks:

: all-tasks ( -- tasks )
    "Category:Solutions_by_Programming_Task" list-categories ;

: draft-tasks ( -- tasks )
    "Category:Draft_Programming_Tasks" list-categories ;

Each task page is a series of sections, beginning with the task description, and then a series of solutions in different programming languages. Using page-content, we can see what one of these pages looks like:

IN: scratchpad [ "Sieve_of_Eratosthenes" page-content ] with-rosetta-code

We can build a word that extracts a section that is specified by a begin text and an end text, searching for them using subseq-index to find where they occur in the page:

:: extract-section ( page begin end -- section/f )
    page begin subseq-index [
        begin length +
        dup page end subseq-index-from
        [ page length ] unless*
        page subseq
    ] [ f ] if* ;

The description is everything before the first header section:

: get-description ( page -- description/f )
    "=={{header" over subseq? [
        "" "=={{header" extract-section
    ] [ drop f ] if ;

The solution code is the first <syntaxhighlight> block for our desired language:

: get-code ( page lang -- code/f )
    "<syntaxhighlight lang=\"" "\">" surround
    "</syntaxhighlight>" extract-section ;

We can use those words to weave the commented-out description with the Factor source code:

: get-solution ( task -- solution/f )
    page-content [ get-description ] keep over empty?
    [ 2drop f ] [
        [ string-lines [ "! " prepend ] map "\n" join ]
        [ "factor" get-code "\n\n" glue "\n" append ] bi*
    ] if ;

That works great, you can try it by printing out one of the draft tasks:

IN: scratchpad [ "10001th_prime" get-solution print ] with-rosetta-code
! Task:
!
! Find and show on this page the 10001st prime number.

USING: math math.primes prettyprint ;

2 10,000 [ next-prime ] times .

Now we want a way to save a task, and since the tasks have names that aren’t all valid in filenames or vocabulary names, we do a little cleanup to turn a task name into a path:

: task-path ( task -- path )
    [ dup { [ Letter? ] [ digit? ] } 1|| [ drop CHAR: - ] unless ] map
    >lower R/ --+/ "-" re-replace [ CHAR: - = ] trim ".factor" append ;

Saving a task is getting the solution and then saving to a file:

: save-task ( task -- )
    "vocab:rosetta-code/solutions" [
        [ get-solution ]
        [ task-path '[ _ utf8 set-file-contents ] when* ] bi
    ] with-directory ;

With that, we can finally save all the tasks, or all the draft tasks:

: save-all-tasks ( -- )
    all-tasks [ save-task ] each ;

: save-draft-tasks ( -- )
    draft-tasks [ save-task ] each ;

I used this, with some minor changes to ignore certain categories that do not contain solutions, as well as using Pandoc to convert the MediaWiki markup before embedding in the solution files.

Anyway, pretty cool!