Rosetta Code Downloaded
Wednesday, July 17, 2024
I have been quite distracted by Rosetta Code shenanigans in the past few days. For anyone that is curious about the backstory, I removed the rosetta-code vocabulary from the main Factor git repository, and then decided to archive all the Factor solutions to a separate factor-rosetta-code repository for posterity, utility, and analysis.
I thought it would be fun to talk about different ways to do web scraping for the Rosetta Code website, ultimately choosing to write a Factor vocabulary to do it, which I’ll go into below.
Public Datasets
There are a couple public datasets and scraper tools that you can use:
- Hugging Face has a @christopher/rosetta-code dataset, but it looks like it was updated “about 2 years ago”, so perhaps doesn’t contain recently contributed solutions.
- The @acmeism/RosettaCodeData repository seems to be quite up-to-date and uses a RosettaCode CPAN module and the MediaWiki API to synchronize their data files periodically. At the moment, this is over 600 MB to clone.
- The @brollb/rosetta-code-scraper repository is a Rust scraper that uses the reqwest and scraper crates to parse the web pages and extract task descriptions and solutions. This required some minor tweaks to get working with a recent Rust version, and had some issues with the newer HTML being generated.
Using Factor
I started with the previous approaches, but then I realized that I kinda wanted to build my own program that only grabbed the Factor solutions and weaved them into solution files with the task description for the factor-rosetta-code repository.
After considering using the rendered HTML
from the Rosetta Code website, I decided it would be
a lot simpler to use the MediaWiki API
and our mediawiki.api
vocabulary to
extract the tasks. That vocabulary requires an endpoint
to be specified, so
we define a simple combinator that sets it before running a
quotation.
: with-rosetta-code ( quot -- )
[ "https://rosettacode.org/w/api.php" endpoint ] dip
with-variable ; inline
The Rosetta Code
solutions
consists of a list of pages as well as sub-categories with their own lists of
pages. We need a list-category
word that will get the members of a given
category, memoized
in case pages reference each other or to speed up subsequent calls through
caching:
MEMO: list-category ( title -- members )
'H{ { "list" "categorymembers" } { "cmtitle" _ } }
query [ "title" of ] map ;
And a list-categories
word that will recursively resolve categories
containing other categories:
: list-categories ( title -- tasks )
list-category [ "Category:" head? ] partition swap
[ list-categories ] map concat append harvest members sort ;
Using these, we can retrieve all tasks and draft tasks:
: all-tasks ( -- tasks )
"Category:Solutions_by_Programming_Task" list-categories ;
: draft-tasks ( -- tasks )
"Category:Draft_Programming_Tasks" list-categories ;
Each task page is a series of sections, beginning with the task description, and then a series of solutions in different programming languages. Using page-content, we can see what one of these pages looks like:
IN: scratchpad [ "Sieve_of_Eratosthenes" page-content ] with-rosetta-code
We can build a word that extracts a section that is specified by a begin
text and an end
text, searching for them using
subseq-index
to find where they occur in the page:
:: extract-section ( page begin end -- section/f )
page begin subseq-index [
begin length +
dup page end subseq-index-from
[ page length ] unless*
page subseq
] [ f ] if* ;
The description is everything before the first header section:
: get-description ( page -- description/f )
"=={{header" over subseq? [
"" "=={{header" extract-section
] [ drop f ] if ;
The solution code is the first <syntaxhighlight>
block for our desired
language:
: get-code ( page lang -- code/f )
"<syntaxhighlight lang=\"" "\">" surround
"</syntaxhighlight>" extract-section ;
We can use those words to weave the commented-out description with the Factor source code:
: get-solution ( task -- solution/f )
page-content [ get-description ] keep over empty?
[ 2drop f ] [
[ string-lines [ "! " prepend ] map "\n" join ]
[ "factor" get-code "\n\n" glue "\n" append ] bi*
] if ;
That works great, you can try it by printing out one of the draft tasks:
IN: scratchpad [ "10001th_prime" get-solution print ] with-rosetta-code
! Task:
!
! Find and show on this page the 10001st prime number.
USING: math math.primes prettyprint ;
2 10,000 [ next-prime ] times .
Now we want a way to save a task, and since the tasks have names that aren’t all valid in filenames or vocabulary names, we do a little cleanup to turn a task name into a path:
: task-path ( task -- path )
[ dup { [ Letter? ] [ digit? ] } 1|| [ drop CHAR: - ] unless ] map
>lower R/ --+/ "-" re-replace [ CHAR: - = ] trim ".factor" append ;
Saving a task is getting the solution and then saving to a file:
: save-task ( task -- )
"vocab:rosetta-code/solutions" [
[ get-solution ]
[ task-path '[ _ utf8 set-file-contents ] when* ] bi
] with-directory ;
With that, we can finally save all the tasks, or all the draft tasks:
: save-all-tasks ( -- )
all-tasks [ save-task ] each ;
: save-draft-tasks ( -- )
draft-tasks [ save-task ] each ;
I used this, with some minor changes to ignore certain categories that do not contain solutions, as well as using Pandoc to convert the MediaWiki markup before embedding in the solution files.
Anyway, pretty cool!