I teach web development in a secure setting (prison). This necessitates me capturing many many web archives of single web pages to be viewed by my incarcerated students and a wide variety of topics.
After 4 years of simply using the filesystem and nested subfolders, it’s become an administrative nightmare.
I would like to start using a CMS like Kirby with a simple workflow to intelligently tag all of these websites topically and use a search page to locate and display the items.
I’m throwing it out there for people to suggest an approach and maybe point me in the right direction of where to get started. I have oodles of experience but not very much with kirby.
i assume for each website you have folder that contain a single html file and a folder for assets (js, css, images) that where linked on that page. just like when you use crtl-s within a browser.
here is what i would do.
put all websites in a folder that kirby can read easily
create a template “websites” where to store single website. create “website” template (this will have you tags and other meta data about the saved page) that is subpage of “websites” (just like a folder). think of blog and posts.
create a template and page called “import” that uses kirbys Dir class to iterate over each website folder. since importers tend to be run multiple times and might run longer than 30sec till working perfectly you might need some kind off offset/limit logic or use kirby in the CLI instead of a template (like with my janitor plugin).
in that template “import” get the page(‘websites’) and call createChild to create subpages. the Dir class will help you get filenames or filter for html files etc. on the newly create page use php zip class to zip your original website and append that zip kirbys page. so you have websites/some-page/archive.zip. i would grab all text in the html and unhtml it. then put that in a field textarea of the “website” (so the content can be searched). maybe use dom crawling and extract some more meta data as well (open graph tags might be a good start).
the template for the “website” will instead of rendering its own content extract the zip to a folder in kirbys public folder (thats where the root index.php is). for example /media/websites/some-page/*
instead of rendering content from kirby the “website” template will read the html from extracted archive and maybe fix a few paths using simple string replace to make it point to the right folder.
why the zip? its smaller to store. you can freely clean the media folder to get rid of lots of tmp extracted files.
Wow, thanks. I’m going to take a minute to digest all of this…
The steps seem straightforward… Because I’m going to amass thousands of sites this way I’d ideally I’d like to get my workflow down to putting a bunch of html+resourceFolder pairs in a directory and finding a way to expediently tag the items.
I guess the prison IT department doesn’t want to get a proxy server with whitelist?
Seriously:
I’d say that unless your “topic tagging” logic can’t be represented hierarchically, keep your current folder structure and use Kirby to navigate the folders with a “nicer than apache index” UI. You could put your static archive into a folder in your webroot, and generate virtual pages in Kirby by scanning that folder structure. Each virtual page could then just link to the static URL.
This way you have something functional up in a rather short time.
In a second step you could then add metadata (as txt files) to the topic folders. This allows for a even nicer UI. Like by giving them a human readable name and description, or synonym tags for a search function.
In a third step you can add metadata to the individual pages. At this point it might become interesting to index that metadata, in a database or with the help of a plugin like this one.
All of this can either be done by just editing folders and files, or via the panel if you override the right methods in the page model you use for the virtual pages