Last modified by superadmin on 2018-01-12 20:29

HW14 - Wikipedia Backups by Crawling

Goal: Process wiki markup and implement controllable updates of downloaded data. 

Regularly making offline copies of certain Wikipedia data can provide data, which is both up-to-date and controllable. Develop a simple using programming language Groovy to do backups of some Wikipedia category, e.g. articles listed as Latgalian test incubator http://incubator.wikimedia.org/wiki/Category:Latgalian_Wikipedia (or any other category in Wikipedia, possibly containing sub-categories - see e.g. http://lv.wikipedia.org/wiki/Kategorija:Informācijas_tehnoloģijas. This software should download the wiki-markup sources and download them to predictable folders in subdirectories. It should be designed so as to protect some low-activity Wikipedia content from vandalism and perform mass-updates of certain data.

Students are encouraged to localize and configure existing crawler solutions, if they are available in Java. 

Created by Kalvis Apsītis on 2007-10-21 16:03
This wiki is licensed under a Creative Commons 2.0 license
XWiki Enterprise 6.4 - Documentation