Developing a mailbot script

Address Catcher

© Lead Image © Konstantin Inozemtcev, 123RF.com

© Lead Image © Konstantin Inozemtcev, 123RF.com

Article from Issue 292/2025
Author(s):

A Python script that captures email addresses will help you understand how bots analyze and extract data from the web.

Bots crawl around constantly on the Internet, capturing information from public websites for later processing. Although the science of bot design has become quite advanced, the basic steps for capturing data from an HTML page are quite simple. This article describes an example script that extracts email addresses. The script even provides the option to extend the search to the URLs found on the target page. Rolling your own bot will help you build a deeper understanding of privacy defense and cybersecurity.

Setting Up the Environment

I recommend setting up an integrated development environment, like Visual Studio (VS) Code for Python programming, and having a basic understanding of the language. You can download VS Code from the VS Code website [1]. On Ubuntu, an easy way to install the application is by downloading the .deb package, right-clicking the file, and selecting the Install option. Alternatively, you can search for "vscode" in the App Center and click the Install button. If you prefer using the terminal, the VS Code website [2] provides detailed instructions for any Linux distribution. I also suggest adding Python development extensions, including Pylance and the Python Debugger.

The Script

The full text of the mailbot.py script is available on the Linux Magazine website [3]. Listing 1 shows the beginning of the script where I import the modules I will need to manage communications via the HTTP protocol, search for string patterns using regular expressions, implement asynchronous functions, manage script input arguments, and show a progress bar to track process advancement. The alive_progress module is not part of the standard library, so I have to install it with the following command:

pip install alive-progress

Listing 1

Importing Modules

01 import urllib
02 import urllib.request
03 import re
04 import asyncio
05 import argparse
06 from alive_progress import alive_bar
07 import sys

Listing 2 first defines the asynchronous function ExtractURLs(myUrl), which takes the web address to be scanned as a parameter and returns a list of URLs. The code then analyzes each entry to search for email addresses in case the user has requested recursive scanning. The elements to be added to the list are selected and extracted using a regular expression, defined in the variable regex and used as a parameter in the findall function. The findall function returns all occurrences that match the format I have set. I store these occurrences in the list t, which is finally returned as the result of the ExtractURLs(myUrl) function.

Listing 2

Defining Asynchronous Functions

01 async def ExtractURLs(myUrl):
02   try:
03     t=[]
04     regex="(?P<url>https?://[^\\s'\"]+)"
05     t=re.findall(regex, myUrl)
06   except:
07     pass
08   finally:
09     return t
10
11 async def ExtractMails(myUrl):
12   try:
13     q=[]
14     regex="[\\w\\.-]+@[\\w\\.-]+\\.\\w+"
15     u=re.findall(regex, myUrl)
16   except:
17     pass
18   finally:
19     return u

Similarly, I then define the asynchronous function ExtractMails(myUrl), which returns a list of email addresses found in the string myUrl, based on another regular expression that I have defined. Both functions described above are enclosed in a try, except, finally construct. In case of an error, the script does not perform any operations, in order to avoid premature termination or producing a final output in a nonstandard and thus unusable format. Regardless of the outcome, both functions return a list, containing web addresses and email addresses, respectively. The regular expressions used for extracting URLs and email addresses have been adapted from discussions on the Stack Overflow forum [4, 5]. I could refine ExtractURLs(myUrl) and ExtractMails(myUrl) to ensure they return a list strictly populated with valid values; however, I prefer to prioritize sensitivity over specificity in the extraction process. In other words, I choose to return a broader list of addresses that includes all available ones, rather than a shorter list of likely valid addresses that may exclude some equally valid entries. The rationale behind this choice is based on the fact that sending emails is a quick and low-cost operation. This approach allows me to maximize the number of users reached at the cost of a small percentage of undelivered emails.

In Listing 3 I process the script parameters, two of which are mandatory: the web address to analyze, url, and the name of the output file, output. The third parameter, -r, is optional and extends the search for email addresses to the URLs contained within the web page defined by the first parameter. I implement only one level of recursion, because the number of addresses involved in the search (and consequently the required resources) would grow exponentially with each iteration. Next, I open a connection to the web page specified in the url argument and assign its source code to the variable webContent.

Listing 3

Processing Script Parameters

01 try:
02   parser=argparse.ArgumentParser()
03   parser.add_argument("url", help="Url to analyze", type=str)
04   parser.add_argument("output", help="Output file", type=str)
05   parser.add_argument("-r", "--recursive", help="Recursive search", action="store_true")
06   args=parser.parse_args()
07
08   response = urllib.request.urlopen(args.url)
09   webContent = response.read().decode('UTF-8')

Listing 4 scans the email addresses in the string I just obtained as a result of ExtractURLs(myUrl) function. At this point, I check whether the user has requested a recursive search. If not, the operation is completed, and I print the list variable z (where I have stored the found email addresses) to the output file. Otherwise, I perform a search for all URLs contained in the webContent string, and, for each of them, I call the ExtractMails(myUrl) function again. I perform the searches using asynchronous functions to ensure that the email extraction process occurs in an orderly fashion and only after the URL search has been completed. Specifically, the list v contains all the URLs extracted from webContent as its elements. I loop through the list v, calling the ExtractMails(myUrl) function for each element. The variable k contains the email addresses extracted from the current element of the v list. If the length of k is greater than zero (i.e., if at least one address is found), I add it to the list z, which holds all the email addresses extracted so far. At the same time, I provide the user with a visual progress indicator, using a real-time updating progress bar, which takes a value between zero and the length of the v list, corresponding to the number of URLs to analyze. The current value of the bar is equal to i, a counter that tracks the iterations through the URL list. The bar is updated at each iteration by calling the bar() function.

Listing 4

Scanning Email Addresses

01   z=asyncio.run(ExtractMails(webContent))
02
03   if(args.recursive==True):
04     v=asyncio.run(ExtractURLs(webContent))
05     with alive_bar(len(v)) as bar:
06       for i in range(len(v)):
07         k=asyncio.run(ExtractMails(v[i]))
08         if(len(k)!=0):
09           z.extend(k)
10         bar()
11
12   if(len(z)>0):
13     z[-1] = z[-1] + ";"
14     with open(args.output, "w") as f:
15       print(*z, sep="; ", file=f)
16     print("Found " + str(len(z)) + " mail addresses.")
17   else:
18     print("No mail addresses were found.")
19 except Exception as e:
20   print(e)

I check if any email addresses have been extracted and, if not, notify the user with the message "No mail addresses were found" (Figure 1). If the check is positive, I append a semicolon to the last element of the z list. Next, I open the file specified in the args.output argument and write all the elements of the z list to it, each separated by a semicolon and a space. Finally, I inform the user about the number of email addresses found. The output file will thus contain a sequence of addresses already properly formatted for use as a recipient list for an email.

Figure 1: If no mail addresses are extracted, the mailbot issues a "No mail addresses were found" message.

Use Case Example

The script should be executed with the following syntax:

python mailbot.py website_name output_file [-r]

Instead of website_name, you should insert the full address of the page, including the "http://" or "https://" prefix. output_file refers to the text file where you want to save the list of found email addresses. Finally, the optional parameter -r enables a recursive search. Therefore, to recursively extract email addresses from www.mysite.org and save the corresponding list to output.txt, you just need to type the following:

python mailbot.py https://www.mysite.org output.txt -r

A progress bar indicates the real-time progress of the search process. Figure 2 shows the appearance of the terminal while the operation is in progress. If the -r parameter is not specified, the search is performed only on the page provided as the first parameter of the script. The output file will contain a list of extracted email addresses, separated by a semicolon and a space. This way, the resulting string can be used directly as a recipient list without further processing.

Figure 2: Mailbot script at work in the terminal.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Bash Web Maintenance

    Use tools such as grep and sed to find and fix broken links.

  • Perl: Yahoo API Scripting

    Following in the footsteps of Google, Amazon, and eBay, Yahoo recently introduced a web service API to its search engine. In this month’s column, we look at three Perl scripts that can help you correct typos, view other people’s vacation pictures, and track those long lost pals from school.

  • Bash Data Gathering

    With some simple Bash commands, you can gather, parse, and filter text data into CSV files ready for your favorite statistical application.

  • Tutorials – Attachment Extraction

    If your inbox is full of email messages with important attachments, retrieving those attachments manually can be a tedious task. The script presented in this article does this task automatically and can even save the email as a plain text file.

  • Perl: Searching Git

    GitHub is not only home to the code repositories of many well-known open source projects, but it also offers a sophisticated API that opens up wonderful opportunities for snooping around.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News