Python kirby scraper

Hi!

Finally I can share something perhaps useful!

I had to convert an old website running on The Secretary to kirby, and instead of cleaning up the csv file I exported from the mysql database, I put together a small python script to automatize that.

The script visits a bunch of sub-pages from an ‘index’ page of the website, and convert all that into kirby subfolders with a txt file and images.

It works like

python scraper.py <url w/ links to visit> <subfolder-name> <page-name>

You have to adapts the parsing operations to suit your use-case but in general:

  • use a page containing a list of subpages, or better
  • grab from each subpage what you need and add it to the article {} dictionary
  • let the script create a subfolder for each subpage visited, with a page.txt kirby-formatted file, and with pictures

More infos here and you can find the script here. Also the code is commented.

The code could probably be way more pythonic, but pretty satisfied for two half-afternoons of work.

2 Likes

Nice one… just a thought though - might be simpler to get it to accept a a text file containing each URL on a line by its own. That way you can grab the url list with wget and either write it to a text file or just pipe it straight through to your python script.

Saves having to make a page/template up in whatever CMS to list out all the pages. I’m not much of a terminal ninja, but after a quick little test, this seems to do it. The list of files extensions is what you want to exclude:

wget -m https://example.com 2>&1 | grep '^--' | awk '{ print $3 }' | grep -v '\.\(eot?\|woff\|json\|ico\|pdf\|eot\|txt\|ttf\|otf\|svg\|css\|js\|png\|gif\|jpg\|JPG\)$' > urls.txt

yep, that’s a good point!

the original use case for this code was to grab urls from a sitemap.xml page, so definitely the starting point can vary depending on the use case scenario.

:slight_smile: