Security :: Cornell Spider - Linux

Printer-friendly versionSend to friendPDF version

Spider is an open source network forensics tool developed at Cornell University to identify the presence of sensitive information on a computer or attached storage device.

It is designed to recursively process a mounted volume, searching files for a limited set of regular expressions. Because of the need to protect that specific data and because the numeric format lends itself well to accurate matching, the default regular expressions focus on social security numbers and credit card numbers. Other data, such as driver's license numbers, dates of birth, telephone numbers, and the like are frequently present on disk but are not in and of themselves protected data. Spider will generate false positives, so users should make every effort to verify or otherwise Spider's results before moving, encrypting, or deleting files. Spider's logs can function as a roadmap to confidential data and must be stored securely.

The four basic components are (in /usr/local/cornell/spider in Cornell's Helix distro):

  • the spider collector process that receives matches from Spider clients and records them to its log file
  • the Spider client, responsible for processing files
  • spider.conf: the configuration file used by both the client and server. It contains the shared secret, path to the log file written by spider_server, path to the regular expressions file, and a few other configuration parameters
  • REGEXES: the list of regular expressions to use

The default configuration will encrypt all traffic between the Spider client and server and use an MD5 hash of each packet to ensure the data is received intact and undamaged. Spider_server listens on UDP port 3000 for incoming client messages, processes them, and writes them to the log file.

What You Need

Perl 5.6.1 or higher with the following modules:

  • MD5
  • SHA1
  • Crypt:CBC and Crypt::Blowfish
  • Config::IniFiles
  • File::LibMagic
  • File::Path
  • Archive::Zip

Any or all of the following auxiliary programs:

  • file
  • wvText (for converting Word docs to text)
  • unzip
  • unrar
  • lha
  • unzoo
  • arj
  • readpst

The software can be downloaded at the Cornell University website.

Getting Started

It is only necessary to start the server once, either on the Helix instance used for examination or on a central Spider log host:


Spider_server will print its configuration then background itself.

Security Notes

Spider uses an encryption and integrity checking mechanism intended to prevent the eavesdropping of results on the local LAN. The protocol should be sufficiently sound to prevent disclosure in a low threat environment. Captured Spider sessions can be decrypted after the fact if the shared secret itself is reused or disclosed, so every effort should be made to protect Spider's key when it's used outside the local machine.

Spider serves to select and concentrate sensitive data in the log file. That file exists in a RAM disk under Helix, but may not when operating under other systems. The log file should be securely recorded to removable media and stored safely. Once this is done, the log file should be wiped with a disk wiping utility of known and reliable behavior.


Normal Spider behavior is to recursively scan directories and process all readable files. Files of certain types, including binary graphics data, RPMs and executables, are not processed as they yield very high false positive rates. Every effort is made to deal with Unicode, Office file formats, PDFs, mailboxes, and the like. Spider_server's log file begins with "GIF89a" deliberately to cause Spider to skip processing that file, should circumstances dictate it be in the examination tree.

Default arguments to Spider are generally sufficient:

/usr/local/cornell/spider/ -D /path/to/mount

Spider can be fairly verbose, and will report files under examination.

The Spider log file

The Spider log file records the IP address of the Spider client making the report (localhost in most cases), the path to the file being examined, the regular expression that caused the match, and roughly 1K of human-readable text in which the match was found.

False positives are not uncommon, as many strings are likely to match SSN or credit card account number patterns. Visual inspection of the Spider log file is necessary as a final step in order to determine whether or not sensitive data is present and to what degree.

Spider_server command line options

-c <file>

read configuration from <file> instead of /usr/local/cornell/spider/spider.conf

-l <log>

write to log; path in spider.conf supersedes this option

-r <regexes>

read regular expressions from regexes; path in spider.conf supersedes this option


[verbose] will cause spider to be excessively chatty in its operation. Also nullifies the default behavior of Spider to only report the first match in a file


[unscrew] will cause Spider to recursively process directories, converting Windows filename convention to UNIX filename convention. This is NOT forensically sound as it requires modifying the evidence drive, can be extremely destructive to production systems, and makes post-examination followup difficult. It is also unnecessary 99% of the time with Spider 4.0


[test] will cause Spider to compile the regular expressions in the file specified by spider.conf and report any syntactical errors

-D <dir>

will cause Spider to begin processing at directory <dir>


[show] will print the regular expressions in the file specified in spider.conf

-c <config>

will cause Spider to use <config> as its configuration file instead of /usr/local/cornell/spider/spider.conf

Spider.conf options

The spider.conf configuration file consists of a series of keyword/value pairs delimited by equals ("=") characters. Pound ("#") signs anywhere on a line are considered comments and are ignored. Keywords are case insensitive.

Definable options are as follows:

Logfile (path) path to spider_server's record of incoming pattern matches; defaults to /tmp/logfile.IP where "IP" will be replaced by the Spider client IP address

Regexes (path) path to Spider's regular expressions file. Be aware that overzealous regular expressions slow Spider considerably and result in higher false positive rates.

Use_hmac (0|1) determines whether Spider will prefix each packet with an MD5 of the payload. Default is 1, which allows Spider to detect decryption errors or tampering.

Hmac (md5|sha1) determines which hashing algorithm will be used for integrity checking

Encrypt (0|1) determines whether the communications between Spider client and server are encrypted. Defaults to 1. As Spider will, by design, selectively discover and concentrate sensitive data, encrypted communications are extremely desirable.

Cipher (Blowfish|DES) selects the cipher to use of those available to the Crypt::CBC perl module. Blowfish is the default.

Key (user defined) gives the encryption key used by both Spider client and spider_server for secure communications. Ideally, this is uniquely defined for each Spider use in non-loopback communications and never reused. No effort is made to periodically change keying material and the communication of keying material to Spider clients is assumed to be the security responsibility of the user.

Interface (user defined, default gives the interface on which spider_server should listen for connections. Giving the localhost interface prevents outside clients from accessing spider_server

Port (user defined, default 3000) gives the port on which Spider client and server communicate. Protocol is UDP and max payload size is 1024 bytes plus any overhead incurred by HMAC

Loghost (user defined, default localhost) is the log server to which Spider clients should send their results.

Summary 0|1; determines whether to send a summary of file types to the loghost. Can be used to keep server logs brief.

Types <number>; only report on the top <number> of file types found

Max_depth <bytes>; only scan files to <bytes> depth. Generally results in a speed improvement if kept less than 20000

Unprint <quoted char>; replace unprintable characters with <char>. Defaults to "."

Unpack 0|1; unpack archives in /tmp and scan the results. Can slow Spider but does result in better discovery


Thanks to Cornell University for providing this tool. The most recent information concerning Cornell Spider for Linux may be found at

Related Topics

Data Protection, Security