HW14 - Wikipedia Backups by Crawling

Last modified by superadmin on 2018-01-12 20:32

HW14 - Wikipedia Backups by Crawling

Goal: Process wiki markup and implement controllable updates of downloaded data. 

Regularly making offline copies of certain Wikipedia data can provide data, which is both up-to-date and controllable. Develop a simple software using programming language Groovy to do backups of some Wikipedia category, possibly containing sub-categories. See e.g. http://lv.wikipedia.org/wiki/Kategorija:Informācijas_tehnoloģijas). This software should download the wiki-markup sources to predictable folders in subdirectories. It should be designed so as to protect some low-activity Wikipedia content from vandalism and perform mass-updates of certain data.

Students are encouraged to localize and configure existing crawler solutions, if they are available in Java. 


Created by Alina Vasiljeva on 2009-11-26 11:23
This wiki is licensed under a Creative Commons 2.0 license
XWiki Enterprise 6.4 - Documentation