- Ante.lv
- Tīmekļa programmēšana
- Individuālie mājasdarbi
- HW14 - Wikipedia Backups by Crawling
HW14 - Wikipedia Backups by Crawling
HW14 - Wikipedia Backups by Crawling
Goal: Process wiki markup and implement controllable updates of downloaded data.
Regularly making offline copies of certain Wikipedia data can provide data, which is both up-to-date and controllable. Develop a simple software using programming language Groovy to do backups of some Wikipedia category, possibly containing sub-categories. See e.g. http://lv.wikipedia.org/wiki/Kategorija:Informācijas_tehnoloģijas). This software should download the wiki-markup sources to predictable folders in subdirectories. It should be designed so as to protect some low-activity Wikipedia content from vandalism and perform mass-updates of certain data.
Students are encouraged to localize and configure existing crawler solutions, if they are available in Java.