Python kirby scraper

afincato · November 5, 2018, 5:14pm

Hi!

Finally I can share something perhaps useful!

I had to convert an old website running on The Secretary to kirby, and instead of cleaning up the csv file I exported from the mysql database, I put together a small python script to automatize that.

The script visits a bunch of sub-pages from an ‘index’ page of the website, and convert all that into kirby subfolders with a txt file and images.

It works like

python scraper.py <url w/ links to visit> <subfolder-name> <page-name>

You have to adapts the parsing operations to suit your use-case but in general:

use a page containing a list of subpages, or better
grab from each subpage what you need and add it to the article {} dictionary
let the script create a subfolder for each subpage visited, with a page.txt kirby-formatted file, and with pictures

More infos here and you can find the script here. Also the code is commented.

The code could probably be way more pythonic, but pretty satisfied for two half-afternoons of work.

jimbobrjames · November 5, 2018, 6:26pm

Nice one… just a thought though - might be simpler to get it to accept a a text file containing each URL on a line by its own. That way you can grab the url list with wget and either write it to a text file or just pipe it straight through to your python script.

Saves having to make a page/template up in whatever CMS to list out all the pages. I’m not much of a terminal ninja, but after a quick little test, this seems to do it. The list of files extensions is what you want to exclude:

wget -m https://example.com 2>&1 | grep '^--' | awk '{ print $3 }' | grep -v '\.\(eot?\|woff\|json\|ico\|pdf\|eot\|txt\|ttf\|otf\|svg\|css\|js\|png\|gif\|jpg\|JPG\)$' > urls.txt

afincato · November 5, 2018, 7:36pm

yep, that’s a good point!

the original use case for this code was to grab urls from a sitemap.xml page, so definitely the starting point can vary depending on the use case scenario.

Topic		Replies	Views
Kirby Use Case :: Organized Collection of many, many offline websites	3	485	October 23, 2021
Exclude entire subdirectory from sitemap Solved ✅	9	2265	December 12, 2016
CSV Example without Sub Pages Kirby 3	3	344	May 7, 2022
Newbie question re fetching pages of subpage Solved ✅	4	615	September 4, 2016
Creating a one-pager with subpages Solved ✅	3	3702	March 4, 2016

Python kirby scraper

Related topics