Class 8

Advanced ETL and Textual Analytics 2

Monday, October 21, 2024

Class Overview

In this class, students will continue to engage with advanced ETL and Textual Analysis methods to work with real-world email data from the Enron Email Corpus. The class will introduce more sophisticated parsing tools, building upon foundational ETL knowledge while incorporating complex regular expressions to extract, clean, and restructure data from unstructured text. Through case studies and hands-on activities, students will learn how to handle more intricate data challenges, especially those encountered when dealing with large-scale email datasets. The session will deepen their understanding of transforming messy textual data into structured formats for further analysis.

Why is this important?
Mastering advanced ETL processes and textual analysis techniques is vital for accounting professionals working with unstructured or semi-structured data, such as emails or transaction logs. The skills gained in this class prepare students to handle complex data challenges commonly encountered in auditing and forensic accounting. Specifically, working with real-world datasets like the Enron Email Corpus enhances their ability to analyze vast amounts of data, uncover patterns, and derive actionable insights, critical for risk assessment and fraud detection. These advanced techniques also improve their efficiency in managing large datasets and provide a competitive edge in data-driven roles.

Class Materials and Details

Materials:

Case: Enron Email Case Study
Slides: will be available for download by the beginning of class in either powerpoint or pdf formats.
Data: A data update may be required for this class. To ensure your files are the most up-to-date, navigate to ACCTG522_Labs folder and run the command git pull.
Analytics Tools: Alteryx advanced parsing tools in this class.

Review and Extension:
This class builds directly on Advanced ETL and Textual Analytics 1, where we used regular expressions to examine more advanced applications, using real-world data to tackle more complex parsing challenges. By continuing to work with the Enron Email Corpus, students will apply their knowledge of regular expressions to extract relevant information from a larger and messier dataset, further honing the skills learned in the previous class on data preparation and automation.

Preparation:
  1. The background reading and case can either be read in advance, or used as a reference.
  2. Pre-reading the Forensics Case isn't strictly necessary, but will make the work we do easier.

Class Plan:
Teams: during this class, please sit in your discussion teams.
  1. We will work primarily in the labs to explore the Enron Email data set, update your ACCTG522_Labs folder using git pull.
  2. The main goal for this class is to become more comfortable with Advanced ETL (parsing tools) including using regular expressions (RegEx).
  3. After walking through the solution to extracting the emails the class will be open ended with a search for suspicious emails and examining email sentiment.