Documentation

The system is based on my diploma thesis and is described there in detail. There has also been a final presentation of it which gives an overview of the system. Nevertheless the presented system has changed over time, some old features are gone and some new features are being added. The following sections give an overview of how to install and use the Monkey-Spider system.

Software requirements:

Amongst a Linux operating system with the standard tools sed,awk,wget,grep,unzip and a Python 2.4 or higher the following packages are necessary for the Monkey-Spider to operate properly.

The Heritrix web crawler (both versions 1.x and 2.x)http://crawler.archive.org
A PostgreSQL database server reachable over the network with password authentication http://postgresql.org
The ClamAV anti-virus scanner http://www.clamav.net
PyGreSQL the Python interface to the PostgreSQL database http://www.pygresql.org
SOAPpy to access the Microsoft Live Search Web service http://pywebsvcs.sourceforge.net
pYsearch to access the Yahoo! Search Web service http://pysearch.sourceforge.net

Hardware requirements:

Depending on the scope of analysis the hardware requirements can be expanded from a pentium class PC with 128 MB to a server farm with dozens of high performance multicore systems. The bottlenecks for a proper operation of the systems are the bandwidth, the hard disk space and the memory.

Installation:

Monkey-Spider uses a PostgreSQL database to store it's results. This database doesn't have to be installed on the same system and can also be a remote system.

At first the required software has to be installed and accessible from the system PATH.

Then the Monkey-Spider has to be install with the install.sh script as root (Run uninstall.sh to remove Monkey-Spider).

Finally the database scheme for the malware database provided with Monkey-Spider has to be committed to the database with a command like psql -f mw-db-scheme.sql malwaredb if the malware database is called malwaredb.

Usage:

Monkey-Spider consists of several scripts which can be executed independently depending on the research focus of the user.

A normal run consists of three steps:

Seed generation: Prior to crawling anything we have to know which portion of the Web we want to crawl.
Crawl setup and crawling: Before we can queue the found URLs in step one we have to decide how we want to crawl these.
Scanning: After having crawled everything we will scan every item with our detection mechanism i.e. ClamAV for now.

The configuration file for the Monkey-Spider scripts is /etc/monkey-spider.conf. It holds the access details for the malware database and some other optional configuration values.

Step 1 Seeding:

The Heritrix crawler starts crawling with a plain text file called seeds.txt inside of the standard crawl profile. There are four different methods to generate starting seeds for the crawler:

Manual URL addition: URL entries can be added manually during the crawl configuration or directly to the seeds.txt file if we want to analyze a known predefined set of Web sites.
Web search based seeding: Monkey-Spider provides two scripts, namely ms-seeder-websearch-yahoo and ms-seeder-websearch-livesearch, which enable us to make use of the Web Services from Yahoo and Windows Live Search to gather real time search results as URL lists, in the case of which we can provide a valid Application ID for the services. These Application IDs are unique IDs per Yahoo or Microsoft Live users and have to be requested on the Web Service sites
Blacklist seeding: Monkey-Spider provides a script,namely ms-seeder-blacklist, to automatically gather a list of known blacklisted URLs. These type of URLs are often advertisement related URLs but they also have many malicious URLs included. This can be a good starting point for a general research of these type of Web category.
Mail seeding: Monkey-Spider provides also a script, namely ms-seeder-mail-pop3, to examine an email account for mails containing URLs. This mail account could for example be a spamtrap which would then lead us to up to date malicious URLs. Therefore email account with POP3 support are supported.

We can combine the output of the desired URLs manually and copy them to the seeds.txt file inside of the default crawl profile in Heritrix.

Step 2 Crawling:

Before starting to crawl, the crawler has to be configured properly to achieve the aimed results. Heritrix has it's own Web interface. The crawl scope, the timing, the robots honoring policy, the user agent string and the like have to be configured and finetuned carefully. The configuration details for Heritrix can be found in the Heritrix user manual. After the successful configuration the crawl job has to be started. The crawler itself has to be in the running state. Suddenly the crawler will start to produce output in the corresponding jobs/ folder. The crawled content will be placed in the arcs/ folder inside of this and holds every crawled content. Prior to going on, the arcs/ folder has to have at least one file ending with '*.arc.gz', which is the sign for a completed ARC file.

Step 3 Scanning:

If we have at least one file in the arcs/ folder ending with '*.arc.gz' we can execute the script ms-processfolder to scan all finished ARC files for malware utilizing ClamAV. All found items will be stored in an additional directory called attic/ and every found item will be commited to the database.

The Monkey-Spider project