Advanced Code Example — Pandas and DataFrames#
This example demonstrates DataFrame creation, filtering, derived columns, groupby aggregation, merging DataFrames, and applying custom functions — covering the core Pandas operations you’ll use throughout your analytics career.
Business Scenario#
You are analyzing customer transaction data for a regional retail chain. You need to:
- Build a DataFrame from raw data
- Add derived analytics columns (tier, average purchase)
- Filter to specific customer segments
- Aggregate revenue and metrics by region
- Merge with a supplemental loyalty score dataset
- Produce a final summary report
Code#
import pandas as pd
# ── Dataset ──────────────────────────────────────────────────────────
customer_data = {
'customer_name': ['Alice Johnson', 'Bob Martinez', 'Carol Chen',
'David Kim', 'Eve Torres', 'Frank Li', 'Grace Park'],
'region': ['Northwest', 'Southwest', 'Northwest',
'Southeast', 'Southwest', 'Northeast', 'Northwest'],
'total_spent': [1257.30, 430.50, 890.75, 125.00, 1450.00, 675.25, 980.00],
'purchase_count':[12, 4, 9, 2, 15, 7, 11],
'is_premium': [True, False, True, False, True, False, True],
'months_active': [24, 6, 18, 3, 30, 12, 20]
}
df = pd.DataFrame(customer_data)
# ── Derived Columns ──────────────────────────────────────────────────
df['avg_purchase'] = (df['total_spent'] / df['purchase_count']).round(2)
df['monthly_spend'] = (df['total_spent'] / df['months_active']).round(2)
def assign_tier(total_spent: float) -> str:
if total_spent >= 1200: return 'Platinum'
elif total_spent >= 700: return 'Gold'
elif total_spent >= 300: return 'Silver'
return 'Standard'
df['tier'] = df['total_spent'].apply(assign_tier)
# ── Filtering ────────────────────────────────────────────────────────
premium_df = df[df['is_premium'] == True].copy()
high_value_df = df[df['total_spent'] >= 800].copy()
northwest_premium = df[(df['region'] == 'Northwest') & (df['is_premium'] == True)]
# ── GroupBy Aggregation ───────────────────────────────────────────────
regional_summary = (
df.groupby('region')
.agg(
total_revenue=('total_spent', 'sum'),
avg_spent=('total_spent', 'mean'),
customer_count=('customer_name', 'count'),
premium_count=('is_premium', 'sum')
)
.round(2)
.sort_values('total_revenue', ascending=False)
)
tier_summary = (
df.groupby('tier')
.agg(
count=('customer_name', 'count'),
total_revenue=('total_spent', 'sum'),
avg_per_customer=('total_spent', 'mean')
)
.round(2)
)
# ── Merge with Supplemental Data ─────────────────────────────────────
loyalty_data = pd.DataFrame({
'customer_name': ['Alice Johnson', 'Carol Chen', 'Eve Torres', 'Grace Park'],
'loyalty_score': [92, 78, 95, 85],
'churn_risk': ['low', 'medium', 'low', 'low']
})
enriched_df = df.merge(loyalty_data, on='customer_name', how='left')
enriched_df['churn_risk'] = enriched_df['churn_risk'].fillna('unscored')
# ── Output ────────────────────────────────────────────────────────────
print("=" * 62)
print(" FULL CUSTOMER DATASET WITH DERIVED COLUMNS")
print("=" * 62)
display_cols = ['customer_name', 'region', 'tier', 'total_spent', 'avg_purchase', 'monthly_spend']
print(df[display_cols].to_string(index=False))
print("\n" + "=" * 62)
print(" REGIONAL SUMMARY")
print("=" * 62)
print(regional_summary.to_string())
print("\n" + "=" * 62)
print(" TIER DISTRIBUTION")
print("=" * 62)
print(tier_summary.to_string())
print("\n" + "=" * 62)
print(" PREMIUM + CHURN RISK (merged)")
print("=" * 62)
merge_cols = ['customer_name', 'tier', 'total_spent', 'loyalty_score', 'churn_risk']
print(enriched_df[enriched_df['is_premium'] == True][merge_cols].to_string(index=False))
print("=" * 62)Expected Output#
==============================================================
FULL CUSTOMER DATASET WITH DERIVED COLUMNS
==============================================================
customer_name region tier total_spent avg_purchase monthly_spend
Alice Johnson Northwest Platinum 1257.30 104.78 52.39
Bob Martinez Southwest Silver 430.50 107.62 71.75
Carol Chen Northwest Gold 890.75 98.97 49.49
David Kim Southeast Standard 125.00 62.50 41.67
Eve Torres Southwest Platinum 1450.00 96.67 48.33
Frank Li Northeast Gold 675.25 96.46 56.27
Grace Park Northwest Gold 980.00 89.09 49.00
==============================================================
REGIONAL SUMMARY
==============================================================
total_revenue avg_spent customer_count premium_count
region
Southwest 1880.50 940.25 2 1.0
Northwest 3128.05 1042.68 3 2.0
Northeast 675.25 675.25 1 0.0
Southeast 125.00 125.00 1 0.0
==============================================================
TIER DISTRIBUTION
==============================================================
count total_revenue avg_per_customer
tier
Gold 3 2546.00 848.67
Platinum 2 2707.30 1353.65
Silver 1 430.50 430.50
Standard 1 125.00 125.00
==============================================================
PREMIUM + CHURN RISK (merged)
==============================================================
customer_name tier total_spent loyalty_score churn_risk
Alice Johnson Platinum 1257.30 92.0 low
Carol Chen Gold 890.75 78.0 medium
Eve Torres Platinum 1450.00 95.0 low
Grace Park Gold 980.00 85.0 low
==============================================================Key Concepts Demonstrated#
| Concept | Pandas Syntax |
|---|---|
| Create DataFrame | pd.DataFrame(data_dict) |
| Derived column | df['col'] = df['a'] / df['b'] |
| Apply function | df['tier'] = df['total_spent'].apply(assign_tier) |
| Single filter | df[df['is_premium'] == True] |
| Compound filter | df[(df['region'] == 'NW') & (df['is_premium'] == True)] |
| GroupBy + agg | df.groupby('region').agg(...) |
| Merge DataFrames | df.merge(other_df, on='key', how='left') |
| Fill NaN | .fillna('unscored') |
Next: Jupyter Notebook →