Advanced Code Example — Pandas and DataFrames#

This example demonstrates DataFrame creation, filtering, derived columns, groupby aggregation, merging DataFrames, and applying custom functions — covering the core Pandas operations you’ll use throughout your analytics career.

Business Scenario#

You are analyzing customer transaction data for a regional retail chain. You need to:

Build a DataFrame from raw data
Add derived analytics columns (tier, average purchase)
Filter to specific customer segments
Aggregate revenue and metrics by region
Merge with a supplemental loyalty score dataset
Produce a final summary report

Code#

import pandas as pd

# ── Dataset ──────────────────────────────────────────────────────────
customer_data = {
    'customer_name': ['Alice Johnson', 'Bob Martinez', 'Carol Chen',
                      'David Kim', 'Eve Torres', 'Frank Li', 'Grace Park'],
    'region':        ['Northwest', 'Southwest', 'Northwest',
                      'Southeast', 'Southwest', 'Northeast', 'Northwest'],
    'total_spent':   [1257.30, 430.50, 890.75, 125.00, 1450.00, 675.25, 980.00],
    'purchase_count':[12, 4, 9, 2, 15, 7, 11],
    'is_premium':    [True, False, True, False, True, False, True],
    'months_active': [24, 6, 18, 3, 30, 12, 20]
}

df = pd.DataFrame(customer_data)

# ── Derived Columns ──────────────────────────────────────────────────
df['avg_purchase'] = (df['total_spent'] / df['purchase_count']).round(2)
df['monthly_spend'] = (df['total_spent'] / df['months_active']).round(2)

def assign_tier(total_spent: float) -> str:
    if total_spent >= 1200:  return 'Platinum'
    elif total_spent >= 700: return 'Gold'
    elif total_spent >= 300: return 'Silver'
    return 'Standard'

df['tier'] = df['total_spent'].apply(assign_tier)

# ── Filtering ────────────────────────────────────────────────────────
premium_df = df[df['is_premium'] == True].copy()
high_value_df = df[df['total_spent'] >= 800].copy()
northwest_premium = df[(df['region'] == 'Northwest') & (df['is_premium'] == True)]

# ── GroupBy Aggregation ───────────────────────────────────────────────
regional_summary = (
    df.groupby('region')
    .agg(
        total_revenue=('total_spent', 'sum'),
        avg_spent=('total_spent', 'mean'),
        customer_count=('customer_name', 'count'),
        premium_count=('is_premium', 'sum')
    )
    .round(2)
    .sort_values('total_revenue', ascending=False)
)

tier_summary = (
    df.groupby('tier')
    .agg(
        count=('customer_name', 'count'),
        total_revenue=('total_spent', 'sum'),
        avg_per_customer=('total_spent', 'mean')
    )
    .round(2)
)

# ── Merge with Supplemental Data ─────────────────────────────────────
loyalty_data = pd.DataFrame({
    'customer_name': ['Alice Johnson', 'Carol Chen', 'Eve Torres', 'Grace Park'],
    'loyalty_score': [92, 78, 95, 85],
    'churn_risk':    ['low', 'medium', 'low', 'low']
})

enriched_df = df.merge(loyalty_data, on='customer_name', how='left')
enriched_df['churn_risk'] = enriched_df['churn_risk'].fillna('unscored')

# ── Output ────────────────────────────────────────────────────────────
print("=" * 62)
print("  FULL CUSTOMER DATASET WITH DERIVED COLUMNS")
print("=" * 62)
display_cols = ['customer_name', 'region', 'tier', 'total_spent', 'avg_purchase', 'monthly_spend']
print(df[display_cols].to_string(index=False))

print("\n" + "=" * 62)
print("  REGIONAL SUMMARY")
print("=" * 62)
print(regional_summary.to_string())

print("\n" + "=" * 62)
print("  TIER DISTRIBUTION")
print("=" * 62)
print(tier_summary.to_string())

print("\n" + "=" * 62)
print("  PREMIUM + CHURN RISK (merged)")
print("=" * 62)
merge_cols = ['customer_name', 'tier', 'total_spent', 'loyalty_score', 'churn_risk']
print(enriched_df[enriched_df['is_premium'] == True][merge_cols].to_string(index=False))
print("=" * 62)

Expected Output#

==============================================================
  FULL CUSTOMER DATASET WITH DERIVED COLUMNS
==============================================================
 customer_name    region      tier  total_spent  avg_purchase  monthly_spend
 Alice Johnson Northwest  Platinum      1257.30        104.78          52.39
  Bob Martinez Southwest    Silver       430.50        107.62          71.75
    Carol Chen Northwest      Gold       890.75         98.97          49.49
     David Kim Southeast  Standard       125.00         62.50          41.67
    Eve Torres Southwest  Platinum      1450.00         96.67          48.33
      Frank Li Northeast      Gold       675.25         96.46          56.27
    Grace Park Northwest      Gold       980.00         89.09          49.00

==============================================================
  REGIONAL SUMMARY
==============================================================
           total_revenue  avg_spent  customer_count  premium_count
region
Southwest        1880.50     940.25               2            1.0
Northwest        3128.05    1042.68               3            2.0
Northeast         675.25     675.25               1            0.0
Southeast         125.00     125.00               1            0.0

==============================================================
  TIER DISTRIBUTION
==============================================================
          count  total_revenue  avg_per_customer
tier
Gold          3        2546.00            848.67
Platinum      2        2707.30           1353.65
Silver        1         430.50            430.50
Standard      1         125.00            125.00

==============================================================
  PREMIUM + CHURN RISK (merged)
==============================================================
 customer_name      tier  total_spent  loyalty_score churn_risk
 Alice Johnson  Platinum      1257.30           92.0        low
    Carol Chen      Gold       890.75           78.0     medium
    Eve Torres  Platinum      1450.00           95.0        low
    Grace Park      Gold       980.00           85.0        low
==============================================================

Key Concepts Demonstrated#

Concept	Pandas Syntax
Create DataFrame	`pd.DataFrame(data_dict)`
Derived column	`df['col'] = df['a'] / df['b']`
Apply function	`df['tier'] = df['total_spent'].apply(assign_tier)`
Single filter	`df[df['is_premium'] == True]`
Compound filter	`df[(df['region'] == 'NW') & (df['is_premium'] == True)]`
GroupBy + agg	`df.groupby('region').agg(...)`
Merge DataFrames	`df.merge(other_df, on='key', how='left')`
Fill NaN	`.fillna('unscored')`

Next: Jupyter Notebook →