Class 6

Advanced ETL and Textual Analytics 1

Monday, October 14, 2024

Class Overview

In this class, students will engage with advanced ETL and Textual Analysis methods to work with real-world email data from the Enron Email Corpus. The class will introduce more sophisticated parsing tools, building upon foundational ETL knowledge while incorporating complex regular expressions to extract, clean, and restructure data from unstructured text. Through case studies and hands-on activities, students will learn how to handle more intricate data challenges, especially those encountered when dealing with large-scale email datasets. The session will deepen their understanding of transforming messy textual data into structured formats for further analysis.

Why is this important?
Mastering advanced ETL processes and textual analysis techniques is vital for accounting professionals working with unstructured or semi-structured data, such as emails or transaction logs. The skills gained in this class prepare students to handle complex data challenges commonly encountered in auditing and forensic accounting. Specifically, working with real-world datasets like the Enron Email Corpus enhances their ability to analyze vast amounts of data, uncover patterns, and derive actionable insights, critical for risk assessment and fraud detection. These advanced techniques also improve their efficiency in managing large datasets and provide a competitive edge in data-driven roles.

Class Materials and Details

Materials:

Case: Enron Email Case Study
Slides: will be available for download by the beginning of class in either powerpoint or pdf formats.
Data: A data update may be required for this class. To ensure your files are the most up-to-date, navigate to ACCTG522_Labs folder and run the command git pull.
Analytics Tools: Advanced Alteryx parsing tools.

Review and Extension:
This class builds directly on the previous session, where students were introduced to basic ETL processes and the use of regular expressions to clean raw data. While the last class focused on foundational techniques, this class moves into more advanced applications, using real-world data to tackle more complex parsing challenges. By working with the Enron Email Corpus, students will apply their knowledge of regular expressions to extract relevant information from a larger and messier dataset, further honing the skills learned in the previous class on data preparation and automation.

Preparation:
  1. There is no required preparation for this class, it builds off what we started in the prior class.
  2. The case is provided as a reference, it is not necessary to read it before class.

Class Plan:
Teams: during this class, please sit in your discussion teams.
  1. We will work primarily in the labs to explore the Enron Email data set, update your ACCTG522_Labs folder using git pull.
  2. The main goal for this class is to become more comfortable with Advanced ETL (parsing tools) including using regular expressions (RegEx).
  3. After walking through the solution to extracting the emails the class will be open ended with a search for suspicious emails and examining email sentiment.