Class 4

Advanced ETL and Textual Analytics 1

Monday, October 6, 2025

Class Overview

Why is this important?

Mastering advanced ETL processes and textual analysis techniques is vital for accounting professionals working with unstructured or semi-structured data, such as emails or transaction logs. The skills gained in this class prepare students to handle complex data challenges commonly encountered in auditing and forensic accounting. Specifically, working with real-world datasets like the Enron Email Corpus enhances their ability to analyze vast amounts of data, uncover patterns, and derive actionable insights, critical for risk assessment and fraud detection. These advanced techniques also improve their efficiency in managing large datasets and provide a competitive edge in data-driven roles.

What will we do?

In this class, students will engage with advanced ETL and Textual Analysis methods to work with real-world email data from the Enron Email Corpus. The class will introduce more sophisticated parsing tools, building upon foundational ETL knowledge while incorporating complex regular expressions to extract, clean, and restructure data from unstructured text. Through case studies and hands-on activities, students will learn how to handle more intricate data challenges, especially those encountered when dealing with large-scale email datasets. The session will deepen their understanding of transforming messy textual data into structured formats for further analysis.

How this relates to other classes:

This class builds directly on the previous session, where students were introduced to basic ETL processes and the use of regular expressions to clean raw data. While the last class focused on foundational techniques, this class moves into more advanced applications, using real-world data to tackle more complex parsing challenges. By working with the Enron Email Corpus, students will apply their knowledge of regular expressions to extract relevant information from a larger and messier dataset, further honing the skills learned in the previous class on data preparation and automation.

Materials and Preparation

Class Materials
  • Case: Innovation_mindset_case_studies_Cybersecurity_Audit_Enron_Emails
  • Slides: PowerPoint or PDF
  • Data:
  • Analytics Tools: Advanced Alteryx parsing tools: RegEx parsing tools, Filter.
  • Suggested Pre-Class Preparation
    1. Important: Before class, download the extra slides for Class03 on the Class03 page to understand the issues with the PowerBI and Tableau joins and how to fix them (or look for videos in an announcement).
    2. It is not required to pre-read the case, but as the case provides background we won't cover in depth in class, pre-reading the case will help you use it as a reference to solve components of the case.
    3. If you do read the case in advance, note that we will be working on a modified deliverable for this case that includes explanation and summary findings of specific tasks that I will give you in class.
  • Class Plan
    1. We will start with a review and pollEverywhere questions to assess understanding of the joins material with the aim to identify any weaknesses in understanding these concepts.
    2. We will undertake the joins cse in Alteryx, making sure we address the issues discovered with the joins case in PowerBI and Tableau.
    3. Rather than review RegEx, we will work primarily in the labs with Alteryx/RegEx to explore the Enron Email data set.
    4. The main goal for this class is to become more comfortable with Advanced ETL (parsing tools) including using regular expressions (RegEx) to extract specific data.
    5. After walking through the solution to extracting the emails the class will be open ended with a search for suspicious emails.