Chapter 3 Python / R analyses

3.1 Pandas Practice

This chapter is a focused drill on the pandas syntax that comes up most often in data science interviews. The questions use two small, consistent tables so you can stay in the flow of practicing rather than context-switching between datasets.

The best workflow: read each prompt, write the answer from memory, then expand the solution to check. For the trickier questions, say out loud why you reached for groupby, agg, merge, or transform.

For the interactive boxes below, use Run to test your answer and Reset to restore the starter code. Each run starts from a fresh copy of the same preloaded data.

3.1.1 Data Setup

If you want to work through these exercises outside the interactive page, run the cell below to create the two DataFrames used throughout this chapter. Any Python 3.10+ environment will work, and Google Colab is a convenient zero-install option.

import pandas as pd
import numpy as np

orders = pd.DataFrame({
    "order_id": [1, 2, 3, 4, 5, 6, 7, 8],
    "customer_id": [101, 102, 101, 103, 104, 102, 101, 105],
    "category": ["A", "B", "A", "C", "B", "A", "C", "B"],
    "amount": [50, 20, 80, 40, 35, 60, 100, np.nan],
    "order_date": pd.to_datetime([
        "2025-01-01", "2025-01-02", "2025-01-05", "2025-01-07",
        "2025-01-08", "2025-01-10", "2025-01-12", "2025-01-15"
    ])
})

customers = pd.DataFrame({
    "customer_id": [101, 102, 103, 104, 105, 106],
    "name": ["Ana", "Ben", "Cara", "Dan", "Eva", "Finn"],
    "city": ["SF", "NY", "SF", "LA", "NY", "LA"],
    "signup_date": pd.to_datetime([
        "2024-12-01", "2024-12-05", "2024-12-08",
        "2024-12-10", "2024-12-12", "2024-12-15"
    ])
})

orders has 8 rows: order id, customer id, category (A/B/C), dollar amount (one intentionally missing), and order date.
customers has 6 rows: customer id, name, city, and signup date. Customer 106 (Finn) has no orders.

3.1.2 Pandas Tips

These are the patterns that trip people up most in interviews. Drill them until they feel automatic.

If you get stuck, it can help to first sketch the logic in SQL and then translate it to pandas. A good fallback is to think in terms of a CTE that filters or aggregates first, then a final select.

# Boolean filtering
df[df["col"] > 5]
df[df["col"].isin(["A", "B"])]
df[df["col"].isna()]

# Sorting
df.sort_values(["a", "b"], ascending=[True, False])

# New columns with assign (returns a copy — chainable)
df.assign(new_col=df["x"] * 2)

# Groupby aggregations
df.groupby("key")["value"].sum()
df.groupby("key").agg(total=("value", "sum"), avg=("value", "mean"))

# Merge
df.merge(other, on="id", how="left")

# Pivot table
df.pivot_table(index="x", columns="y", values="z", aggfunc="sum")

# Window-style operations
df["rank"] = df.groupby("group")["value"].rank(method="dense", ascending=False)
df["running_total"] = df.sort_values("date").groupby("id")["value"].cumsum()
df["prev"] = df.sort_values("date").groupby("id")["value"].shift(1)

SQL fallback with DuckDB

Example: if you want to think in SQL first, you can still run SQL from Python against your pandas DataFrames with duckdb. Here, orders is the same pandas DataFrame you have been using in the exercises, and duckdb.sql(query).df() runs the SQL and returns the result as another pandas DataFrame:

import duckdb

query = """
WITH filtered_orders AS (
    SELECT customer_id, amount
    FROM orders
    WHERE amount > 50
)
SELECT customer_id, SUM(amount) AS total_spend
FROM filtered_orders
GROUP BY customer_id
"""

duckdb.sql(query).df()

The native pandas version of the same logic is:

filtered_orders = orders[orders["amount"] > 50]
filtered_orders.groupby("customer_id")["amount"].sum().reset_index(name="total_spend")

You can do the same thing with joins too. For example, if you want total spend by customer and city, you can join orders and customers directly in SQL:

import duckdb

query = """
SELECT
    o.customer_id,
    c.city,
    SUM(o.amount) AS total_spend
FROM orders AS o
JOIN customers AS c
    ON o.customer_id = c.customer_id
GROUP BY o.customer_id, c.city
"""

# duckdb.sql(query) returns a DuckDB relation, and .df() converts that
# result into a pandas DataFrame. This works because orders and customers
# already exist in Python as pandas DataFrames.
duckdb.sql(query).df()

The native pandas version of that join is:

merged = orders.merge(customers[["customer_id", "city"]], on="customer_id")
merged.groupby(["customer_id", "city"])["amount"].sum().reset_index(name="total_spend")

DuckDB is useful when the SQL version is clearer in your head, but a few details are worth remembering: it is a separate SQL engine rather than native pandas, so you are querying Python DataFrames through DuckDB rather than through pandas itself; null handling follows SQL rules; and you still need to pay attention to column types, especially dates and missing values.

3.1.3 Basic Filtering and Selection

Q1. Select only the customer_id, amount, and category columns from orders.

Solution

orders[["customer_id", "amount", "category"]]

Q2. Filter orders where amount > 50.

Solution

orders[orders["amount"] > 50]

Q3. Filter orders where category is either "A" or "C".

Solution

orders[orders["category"].isin(["A", "C"])]

Q4. Return rows where amount is missing.

Solution

orders[orders["amount"].isna()]

Q5. Filter orders placed after January 7, 2025.

Solution

orders[orders["order_date"] > "2025-01-07"]

3.1.4 Sorting and Deduplicating

Q6. Sort orders by amount descending.

Solution

orders.sort_values("amount", ascending=False)

Q7. Sort by customer_id ascending and order_date descending.

Solution

orders.sort_values(["customer_id", "order_date"], ascending=[True, False])

Q8. Get one row per customer_id, keeping only their most recent order.

Solution

# Sort descending so the most recent row is first, then drop duplicates
orders.sort_values("order_date", ascending=False).drop_duplicates(subset="customer_id")

Why drop_duplicates instead of groupby? When you need the full row (not just an aggregated value), sorting then deduplicating is cleaner. Use groupby + idxmax if you only need the date itself.

3.1.5 Creating New Columns

Q9. Create a column called amount_filled where missing amounts are replaced with 0.

Solution

# Keep the new column on orders so you can use it later.
orders["amount_filled"] = orders["amount"].fillna(0)

Q10. Create a boolean column high_value that is True for orders with amount >= 75.

Solution

# Keep the new column on orders so you can use it later.
orders["high_value"] = orders["amount"] >= 75

Q11. Create a column amount_with_tax equal to amount * 1.08.

Solution

# Keep the new column on orders so you can use it later.
orders["amount_with_tax"] = orders["amount"] * 1.08

Q12. Create a column category_label where A → "alpha", B → "beta", C → "gamma".

Solution

label_map = {"A": "alpha", "B": "beta", "C": "gamma"}

# Keep the new column on orders so you can use it later.
orders["category_label"] = orders["category"].map(label_map)

.map(dict) is the idiomatic way to recode a categorical column. For more complex multi-condition logic, reach for np.select or pd.cut.

Q13. Create an amount_bucket column that groups amount into "low", "medium", and "high" using cut points 0, 50, and 90.

Solution

orders["amount_bucket"] = pd.cut(
    orders["amount"],
    bins=[0, 50, 90, float("inf")],
    labels=["low", "medium", "high"]
)

pd.cut is useful when you want interview-friendly bucket logic without manually chaining conditions. Values outside the bins or missing values remain NaN.

3.1.6 Groupby

Q14. Find total amount by customer_id.

Solution

orders.groupby("customer_id")["amount"].sum()

Q15. Find average amount by category.

Solution

orders.groupby("category")["amount"].mean()

Q16. Find both the count of orders and total amount by customer_id in one call.

Solution

orders.groupby("customer_id").agg(
    order_count=("order_id", "count"),
    total_amount=("amount", "sum")
)

Named aggregation syntax (col, func) is the modern replacement for the older {"col": func} dict syntax. Prefer it — it avoids MultiIndex columns.

Q17. Find the max order amount for each category.

Solution

orders.groupby("category")["amount"].max()

Q18. Find which customer_id has spent the most in total.

Solution

orders.groupby("customer_id")["amount"].sum().idxmax()

idxmax() returns the index label of the maximum value — here, the customer_id. max() would return the value itself.

3.1.7 Merge and Join

Q19. Join orders with customers to bring in name and city.

Solution

merged = orders.merge(
    customers[["customer_id", "name", "city"]],
    on="customer_id",
    how="inner"
)

Selecting only the needed columns from the right table before merging avoids ending up with unwanted columns.

Q20. After joining orders to customers, find total sales by city.

Solution

merged = orders.merge(customers[["customer_id", "city"]], on="customer_id")
merged.groupby("city")["amount"].sum()

Q21. Join customers with orders and identify which customers have no orders.

Solution

left = customers.merge(orders, on="customer_id", how="left")
left[left["order_id"].isna()][["customer_id", "name"]]

After a left join, rows from the left table with no match will have NaN in all right-table columns. Checking order_id.isna() is the clean way to find them. If you only need an existence check, a hash/set-based alternative like customers[~customers["customer_id"].isin(orders["customer_id"])] is also a good choice and can be very efficient on large datasets.

3.1.8 Missing Data

Q22. Count missing values in each column of orders.

Solution

orders.isna().sum()

Q23. Fill missing amount values with the median amount.

Solution

orders["amount"] = orders["amount"].fillna(orders["amount"].median())

Q24. Drop rows where amount is missing.

Solution

orders = orders.dropna(subset=["amount"])

Always prefer subset= over bare dropna() to avoid accidentally dropping rows just because an unrelated column is null.

3.1.9 Dates

Q25. Extract the day-of-month from order_date into a new column day.

Solution

orders["day"] = orders["order_date"].dt.day

The .dt accessor exposes all datetime components: .dt.year, .dt.month, .dt.day, .dt.dayofweek, .dt.hour, etc.

Q26. Filter orders placed in January 2025.

Solution

orders[
    (orders["order_date"].dt.year == 2025) &
    (orders["order_date"].dt.month == 1)
]

Q27. After merging orders with customers, find the number of days between each customer's signup_date and their order date.

Solution

merged = orders.merge(customers[["customer_id", "signup_date"]], on="customer_id")
merged["days_since_signup"] = (merged["order_date"] - merged["signup_date"]).dt.days

Subtracting two datetime columns yields a Timedelta Series. .dt.days extracts the integer number of days. Forgetting .dt.days is a common mistake.

3.1.10 Reshaping

Q28. Create a pivot table showing total amount by customer_id (rows) and category (columns).

Solution

orders.pivot_table(
    index="customer_id",
    columns="category",
    values="amount",
    aggfunc="sum"
)

Q29. Create a pivot table showing mean amount by city (rows) and category (columns).

Solution

merged = orders.merge(customers[["customer_id", "city"]], on="customer_id")
merged.pivot_table(
    index="city",
    columns="category",
    values="amount",
    aggfunc="mean"
)

3.1.11 Ranking and Window-Style

Q30. Within each customer_id, rank orders by amount descending (rank 1 = largest).

Solution

orders["rank"] = orders.groupby("customer_id")["amount"].rank(
    method="dense",
    ascending=False
)

method="dense" gives tied values the same rank without gaps. method="min" skips ranks; method="first" assigns unique ranks in order of appearance.

Q31. For each customer, compute a running total of amount ordered by order_date.

Solution

orders_sorted = orders.sort_values("order_date")
orders_sorted["running_total"] = orders_sorted.groupby("customer_id")["amount"].cumsum()

Sort before cumsum() — groupby preserves existing row order within each group, so the sort must happen first.

Q32. For each customer, get the previous order's amount using shift.

Solution

orders_sorted = orders.sort_values(["customer_id", "order_date"])
orders_sorted["prev_amount"] = orders_sorted.groupby("customer_id")["amount"].shift(1)

shift(1) lags by one row within each group. The first row per customer gets NaN. shift(-1) gives the next value (lead).

3.1.12 Interview Prompts

These questions are phrased the way an interviewer would ask them. They require you to chain multiple operations together.

Q33. Find each customer's first order date.

Solution

orders.groupby("customer_id")["order_date"].min()

Q34. Find customers who made more than one order.

Solution

counts = orders.groupby("customer_id")["order_id"].count()
counts[counts > 1].index.tolist()

Q35. Find categories where the average order amount is greater than 50.

Solution

avg = orders.groupby("category")["amount"].mean()
avg[avg > 50].index.tolist()

Q36. For each city, find the customer with the highest total spend.

Solution

spend = (
    orders.groupby("customer_id")["amount"]
    .sum()
    .reset_index(name="total_spend")
    .merge(customers[["customer_id", "city"]], on="customer_id")
)
spend.loc[spend.groupby("city")["total_spend"].idxmax()]

idxmax() returns the index labels of the per-group maxima. Passing those to .loc[] retrieves the full rows.

Q37. Find the percentage of total sales contributed by each category.

Solution

cat_totals = orders.groupby("category")["amount"].sum()
(cat_totals / cat_totals.sum() * 100).round(2)

Q38. Return the top 2 largest orders per customer.

Solution

orders.sort_values("amount", ascending=False).groupby("customer_id").head(2)

Sort globally first, then groupby().head(n) takes the first n rows per group — which are the largest after sorting descending.

Q39. Find customers whose order amounts are strictly increasing over time.

Solution

def is_strictly_increasing(s):
    vals = s.dropna().tolist()
    return all(vals[i] < vals[i + 1] for i in range(len(vals) - 1))

result = (
    orders.sort_values("order_date")
    .groupby("customer_id")["amount"]
    .apply(is_strictly_increasing)
)
result[result].index.tolist()

Customers with only one order trivially satisfy this (the all() over an empty iterator returns True). Filter with groupby().filter(lambda g: len(g) > 1) first if that matters.

Q40. Find the gap in days between each order and the previous order for the same customer.

Solution

df = orders.sort_values(["customer_id", "order_date"]).copy()
df["prev_date"] = df.groupby("customer_id")["order_date"].shift(1)
df["days_gap"] = (df["order_date"] - df["prev_date"]).dt.days

Q41. Build a summary table with one row per customer: total orders, total spend, average order amount, and most recent order date.

Solution

orders.groupby("customer_id").agg(
    total_orders=("order_id", "count"),
    total_spend=("amount", "sum"),
    avg_order_amount=("amount", "mean"),
    most_recent_order=("order_date", "max")
).reset_index()

This is a common "profile table" pattern. Named aggregation keeps the output column names clean without a rename step.

3.2 Additional Questions

These additional interview-style questions extend the pandas drill with longer analysis prompts.

The interactive editor for these questions runs Python only. When an R approach is available, it is included below as a collapsed reference solution alongside the Python answer.

3.2.1 Active users on a messaging application

Using message-level data for a P2P messaging application, calculate what fraction of senders sent messages to at least 9 unique people on March 1, 2018.

Python solution

# You can also isolate the same rows with: daily = data[data["date"] == "2018-03-01"]
daily = data.loc[data["date"].eq("2018-03-01")]
unique_receivers = daily.groupby("sender_id")["receiver_id"].nunique()
answer = (unique_receivers >= 9).mean()
answer

R solution

library(dplyr)

df_raw <- read.csv("https://raw.githubusercontent.com/erood/interviewqs.com_code_snippets/master/Datasets/ddi_message_app_data.csv")

df_raw %>%
  filter(date == "2018-03-01") %>%
  group_by(sender_id) %>%
  summarise(unique_sends = n_distinct(receiver_id), .groups = "drop") %>%
  summarise(answer = mean(unique_sends >= 9))

3.2.2 Time for a response on a messaging application

Write code to find the fraction of messages that receive a response from the other user within five minutes.

Python solution

# We just need a unique row id after sorting.
# Using range(len(df)) is often clearer than creating it from the index.
df = data.sort_values("timestamp").copy()
df["message_id"] = range(len(df))

candidate_replies = df.merge(
    df,
    left_on=["sender_id", "receiver_id"],
    right_on=["receiver_id", "sender_id"],
    suffixes=("_sent", "_reply")
)

responded = candidate_replies[
    (candidate_replies["timestamp_reply"] > candidate_replies["timestamp_sent"]) &
    (candidate_replies["timestamp_reply"] - candidate_replies["timestamp_sent"] <= 300)
]

answer = responded["message_id_sent"].nunique() / len(df)
answer

R solution

library(dplyr)

df <- data %>%
  arrange(timestamp) %>%
  mutate(message_id = row_number())

candidate_replies <- df %>%
  inner_join(
    df,
    by = c("sender_id" = "receiver_id", "receiver_id" = "sender_id"),
    suffix = c("_sent", "_reply")
  )

responded <- candidate_replies %>%
  filter(
    timestamp_reply > timestamp_sent,
    timestamp_reply - timestamp_sent <= 300
  )

n_distinct(responded$message_id_sent) / nrow(df)

3.2.3 Cleaning and analyzing employee data

Some date_joined values are null and should be treated as December 1, 2009. Duplicate employee records also exist. Find the number of employees that joined each month.

Python solution

clean = employees.copy()
clean["date_joined"] = pd.to_datetime(clean["date_joined"]).fillna(pd.Timestamp("2009-12-01"))
clean = clean.drop_duplicates(subset=["employee_name", "age", "yrs_of_experience"])
clean["yr_month"] = clean["date_joined"].dt.to_period("M").dt.to_timestamp()

# Equivalent chained version:
# clean = (
#     employees.assign(
#         date_joined=lambda df: pd.to_datetime(df["date_joined"]).fillna(pd.Timestamp("2009-12-01"))
#     )
#     .drop_duplicates(subset=["employee_name", "age", "yrs_of_experience"])
#     .assign(yr_month=lambda df: df["date_joined"].dt.to_period("M").dt.to_timestamp())
# )

clean.groupby("yr_month").size().reset_index(name="n").sort_values("yr_month")

R solution

library(dplyr)
library(tidyr)
library(lubridate)

employees %>%
  mutate(date_joined = as.Date(date_joined)) %>%
  replace_na(list(date_joined = as.Date("2009-12-01"))) %>%
  distinct(employee_name, age, yrs_of_experience, .keep_all = TRUE) %>%
  mutate(yr_month = floor_date(date_joined, "month")) %>%
  count(yr_month, name = "n") %>%
  arrange(yr_month)

3.2.4 Analyzing employee data

Pull the average, median, minimum, maximum, and standard deviation for compensation across five-year experience buckets.

Python solution

bins = [0, 5, 10, 15, 20, float("inf")]
labels = ["0-5", "5-10", "10-15", "15-20", "Over 20"]

summary = (
    employees_comp.assign(
        exp_bucket=pd.cut(
            employees_comp["yrs_of_experience"],
            bins=bins,
            labels=labels,
            right=False
        )
    )
    # observed=True keeps only buckets that actually appear in the data.
    # Without it, pandas can include unused categorical buckets as empty groups.
    .groupby("exp_bucket", observed=True)["compensation"]
    .agg(average="mean", median="median", min="min", max="max", std="std")
    .reset_index()
)

summary

R solution

library(dplyr)

employees_comp %>%
  mutate(
    exp_bucket = case_when(
      yrs_of_experience < 5 ~ "0-5",
      yrs_of_experience < 10 ~ "5-10",
      yrs_of_experience < 15 ~ "10-15",
      yrs_of_experience < 20 ~ "15-20",
      TRUE ~ "Over 20"
    )
  ) %>%
  group_by(exp_bucket) %>%
  summarise(
    average = mean(compensation),
    median = median(compensation),
    min = min(compensation),
    max = max(compensation),
    sd = sd(compensation),
    .groups = "drop"
  )