Welcome to MediaCat

MediaCAT is open-source web-based application, with a curated search engine. It crawls designated news websites and twitter accounts for citations of or hyperlinks to a list of source sites. MediaCAT then archives all referring stories and source stories, in preparation for an advanced analysis of the relations across the digital news-scape.

MediaCAT is being developed as part of an anthropological research project on the global impact of Israeli online news sites in English.

MediaCAT started as a collaboration between an anthropologist, a computer scientist, and digital librarians. Its first phase of development was in Dr. Anya Tafliovich’s third year software engineering course in Fall 2014, when it was the course project. Since then, it has been entirely developed by computer science and anthropology undergraduate students from UTSC, with a small seed grant from the Vice-Principal of Research Impact Fund.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for details.

The Current Features of MediaCat includes:

  • Django-based web interface
  • Crawls both RSS Feeds and Entire Websites
  • Crawler Implemented in Python
  • Crawl finds both relevant hyperlinks and mentions
  • Archiving extracts author, date, title, anchor text
  • Data stored using MySQL
  • Archives in WARC, PDF, and Image Files
  • User interface allows source and referring site entry, aliases, and tagging

2016 Code4Lib Conference

Curate my web crawl: Building a multiprocessing web crawler for ethnographic research

Learn More Video

11th International Digital Curation Conference

Building on Digital Projects: Managing a Digital Course Infrastructure

Learn More

Archives Unleashed: Web Archive Hackathon, 2016

MediaCat

Learn More

Our open source project is on Github