Customer Segmentation Use Case Part 3: Data Exploration and Cleansing with Infoworks AI
Introduction
In this blog post, we’ll continue to address our Customer Segmentation use case by starting to work with our data.
In part 1, we described the Customer Segmentation use case, introduced the RFM segmentation framework, and articulated how we can use that segmentation method to provide value for our fictional organization, Nexacore. Click here to review that blog post.
In part 2, we described how to get started with Infoworks AI by connecting to data, creating a knowledge base, and creating a project. Click here to review that blog post.
In part 3, we will understand, discover, and prepare our data. First, we’ll get a general overview of the data through exploring the metadata and profiling the data. Next, we’ll do some data cleansing by deduplicating and formatting our data.
Data Exploration
To begin, we need to understand the data that is in our data warehouse. First, select your project from the home screen. If you do not yet have a project, see part 2 of our blog series (link above). I’m going to use my Nexacore project.
Metadata
On the project screen, we can start to explore our data using the list of tables to the left. Since we’re doing an analysis on order recency, frequency, and monetary value, I’ll select the orders table to look at the table schema.
Here I can see that my orders table has about 231 thousand rows. I can also see that the table contains an order date column to help me calculate order recency, a customer ID column to help with order frequency, and a total_amount column to assist in calculating my monetary metric.
Data Profiling
Now that I have a better understanding of my orders table, let’s get a better understanding of the data it contains. To do that we’re going to use the Infoworks AI chat feature to profile our data.
First I’m going to click the ‘Start conversation here…’ text input on the bottom of the chat screen.
I’m going to ask Infoworks AI to profile the data in my orders table by entering ‘Profile my orders data.’ in the text input and pressing ‘Enter’.
After a moment, the AI model will return a query. In addition to the SQL query, it will also return an explanation of the SQL query as well as additional hints in executing the query.
Note that because we gathered the table metadata as part of the data connection process (see Part 2), Infoworks generates the appropriate profiling query based on the data types of the columns in the table.
Next, I’m going to run the query by pressing the ‘Edit’ button to bring the SQL into the editor. In the editor, I can run the SQL by pressing the ‘Run’.
When I run my query, I get a profile of my data. The data profile provides interesting information about order dates, and the number of null values, but I also notice a data quality issue. My record count is different from my distinct record count indicating I have duplicate records in my data. Let’s do some data cleansing to address this issue.
Data Cleansing
Using natural language, I can request that Infoworks AI does some cleansing of my data. Since I know I have a duplicate data issue, I’ll start there.
Deduplication
In order to deduplicate my data, I’m going to first generate a query to retrieve my distinct records by entering ‘select distinct records by every column from my orders table’ in the chat window on the right half of the screen.
Once I’m happy with the query, I can create a view from the data that will be the deduplicated data set I’ll use for the rest of my analysis. To create a view, I simply type ‘create a view from the above query’ in the chat window on the right and run the results.
Formatting
I also need customer data for my analysis. In looking at my customer data, I need to do some cleansing. I need my first name and last name fields combined into a ‘last name, first name’ format. I also need the dates displayed in ‘MM/dd/yyyy’ format. Lastly, I need the postal code formatted as a 5 character string with leading zeros.
We can fix this in a single request using natural language from Infoworks AI. I’m simply going to enter the following into the chat.
‘from the customers table return the customer id, customer name formatted as ‘last_name, first_name’, birthdate formatted in MM/dd/yyyy format, and postal code formatted as a 5 character string with leading zeros’
The returned SQL correctly formats and cleanses the data.
Summary
In this blog post we used Infoworks AI to begin exploring and organizing our data. We looked at the table metadata to verify we had the columns we needed. We profiled the data and identified a duplicate row issue and we used Infoworks AI to deduplicate the data. Finally, we saw how Infoworks AI enables us to cleanse and format the data.
In Part 4: Data Analysis and Reporting. We’ll see how Infoworks AI enables data analysts to apply the RFM model to our data set and generate reporting and visualizations for their presentation to the CFO.