Deep dive into my data analysis projects with detailed methodologies, code snippets, and business insights
15,500+ Records | 28 Columns
52 Countries | 2021โ2024
MySQL, Power BI, DAX, Power Query
3 Pages (Executive, Customer+Product, Return+Performance)
Samsung operates across 52 countries with thousands of sales records spanning 4 years. The challenge was to build a unified end-to-end analytics pipeline โ from raw CSV data to a 3-page interactive Power BI dashboard โ that could surface global sales trends, regional performance, product category insights, return patterns, and 5G adoption across markets.
Imported 15,500+ row CSV into MySQL using LOAD DATA INFILE with LINES TERMINATED BY '\r\n' fix for Windows encoding. Cleaned nulls, standardized columns, and created derived fields.
Ran full SQL analysis across 7 blocks: category, region, country, channel, returns, customer segments, product performance, 5G trends, and year-over-year comparisons.
Built Executive Overview, Customer+Product, and Return+Performance pages with dark theme and Samsung Blue (#1428A0) styling using DAX measures and Power Query transformations.
CREATE DATABASE samsung_sales;
USE samsung_sales;
CREATE TABLE global_sales (
order_id VARCHAR(50),
order_date DATE,
country VARCHAR(100),
region VARCHAR(50),
product_category VARCHAR(50),
product_name VARCHAR(100),
channel VARCHAR(50),
units_sold INT,
unit_price DECIMAL(10,2),
total_revenue DECIMAL(12,2),
total_cost DECIMAL(12,2),
profit DECIMAL(12,2),
returns INT,
customer_segment VARCHAR(50),
is_5g BOOLEAN
);
-- Import fix for Windows line endings
LOAD DATA INFILE 'C:/Users/satya/OneDrive/Desktop/samsung_global_sales_dataset.csv'
INTO TABLE global_sales
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 ROWS;
-- Data Cleaning
UPDATE global_sales SET channel = 'Unknown' WHERE channel IS NULL;
UPDATE global_sales SET returns = 0 WHERE returns IS NULL;
-- Derived columns
ALTER TABLE global_sales ADD COLUMN year_num INT;
ALTER TABLE global_sales ADD COLUMN quarter_num INT;
UPDATE global_sales SET
year_num = YEAR(order_date),
quarter_num = QUARTER(order_date);
-- 1. Revenue by Region (2021-2024)
SELECT region,
SUM(total_revenue) AS total_revenue,
SUM(profit) AS total_profit,
ROUND(SUM(profit)/SUM(total_revenue)*100, 2) AS profit_margin
FROM global_sales
GROUP BY region
ORDER BY total_revenue DESC;
-- 2. Top 10 Countries by Revenue
SELECT country,
SUM(total_revenue) AS revenue,
SUM(units_sold) AS units,
ROUND(AVG(unit_price), 2) AS avg_price
FROM global_sales
GROUP BY country
ORDER BY revenue DESC
LIMIT 10;
-- 3. Product Category Performance
SELECT product_category,
SUM(units_sold) AS units_sold,
SUM(total_revenue) AS revenue,
SUM(returns) AS total_returns,
ROUND(SUM(returns)/SUM(units_sold)*100, 2) AS return_rate
FROM global_sales
GROUP BY product_category
ORDER BY revenue DESC;
-- 4. Year-over-Year Growth
SELECT year_num,
SUM(total_revenue) AS revenue,
SUM(units_sold) AS units,
LAG(SUM(total_revenue)) OVER (ORDER BY year_num) AS prev_year,
ROUND((SUM(total_revenue) - LAG(SUM(total_revenue)) OVER (ORDER BY year_num))
/ LAG(SUM(total_revenue)) OVER (ORDER BY year_num) * 100, 2) AS yoy_growth
FROM global_sales
GROUP BY year_num;
-- 5. 5G vs Non-5G Sales
SELECT is_5g,
COUNT(*) AS orders,
SUM(total_revenue) AS revenue,
ROUND(AVG(unit_price), 2) AS avg_price
FROM global_sales
GROUP BY is_5g;
-- 6. Channel Performance
SELECT channel,
SUM(total_revenue) AS revenue,
SUM(returns) AS returns,
ROUND(SUM(returns)/SUM(units_sold)*100, 2) AS return_rate
FROM global_sales
GROUP BY channel
ORDER BY revenue DESC;
-- 7. Return Analysis by Region
SELECT region, product_category,
SUM(returns) AS total_returns,
SUM(units_sold) AS units_sold,
ROUND(SUM(returns)/SUM(units_sold)*100, 2) AS return_rate
FROM global_sales
GROUP BY region, product_category
ORDER BY return_rate DESC;
// Total Revenue
Total Revenue = SUM(global_sales[total_revenue])
// Total Profit
Total Profit = SUM(global_sales[profit])
// Profit Margin %
Profit Margin =
DIVIDE([Total Profit], [Total Revenue], 0) * 100
// Total Units Sold
Total Units = SUM(global_sales[units_sold])
// Total Returns
Total Returns = SUM(global_sales[returns])
// Return Rate %
Return Rate =
DIVIDE([Total Returns], [Total Units], 0) * 100
// YoY Revenue Growth
YoY Growth =
VAR CurrentYear = CALCULATE([Total Revenue], YEAR(global_sales[order_date]) = MAX(YEAR(global_sales[order_date])))
VAR PrevYear = CALCULATE([Total Revenue], YEAR(global_sales[order_date]) = MAX(YEAR(global_sales[order_date])) - 1)
RETURN DIVIDE(CurrentYear - PrevYear, PrevYear, 0) * 100
// 5G Revenue Share
5G Revenue Share =
DIVIDE(
CALCULATE([Total Revenue], global_sales[is_5g] = TRUE),
[Total Revenue], 0
) * 100
// Top Country Revenue
Top Country =
CALCULATE(
[Total Revenue],
TOPN(1, ALL(global_sales[country]), [Total Revenue])
)
Asia-Pacific leads revenue across all 4 years with consistent YoY growth, driven by high smartphone penetration.
5G products grew from 18% revenue share in 2021 to 54% in 2024 โ fastest growing product category globally.
Online channel shows 2.3x higher return rate vs in-store โ suggesting product expectation gaps in digital sales.
Global revenue grew 31% from 2021 to 2024 with Q4 consistently outperforming all other quarters across regions.
55,500+ Patient Records
6 Medical Conditions
Power BI, DAX, Power Query
Bookmarks, Human Body Toggle, Dynamic Filters
Healthcare providers need quick visibility into patient risk patterns across multiple medical conditions. The goal was to analyze 55,500+ patient records and build an interactive Power BI dashboard that enables clinical and operational teams to identify high-risk patient segments, track test result distributions, and monitor billing patterns โ with a unique Human Body Toggle feature for intuitive healthy vs unhealthy patient comparison.
Loaded 55,500 row CSV into Power BI via Power Query. Handled nulls, standardized date formats, created age group buckets, and built calculated columns for risk segmentation.
Built core measures for Total Patients, Avg Billing, Avg Age, Abnormal %, and condition-specific calculations using CALCULATE and DIVIDE functions.
Implemented dynamic filter buttons (Normal/Abnormal/Inconclusive), bookmark-based Human Body Toggle, conditional formatting, and custom slicer styling for clinical usability.
// Total Patients
Total Patients = COUNTROWS('healthcare_dataset')
// Average Billing Amount
Avg Billing = AVERAGE('healthcare_dataset'[Billing Amount])
// Average Age
Avg Age = AVERAGE('healthcare_dataset'[Age])
// Abnormal Test Result %
Abnormal % =
DIVIDE(
CALCULATE(
COUNTROWS('healthcare_dataset'),
'healthcare_dataset'[Test Results] = "Abnormal"
),
COUNTROWS('healthcare_dataset'), 0
) * 100
// Normal Test Result %
Normal % =
DIVIDE(
CALCULATE(
COUNTROWS('healthcare_dataset'),
'healthcare_dataset'[Test Results] = "Normal"
),
COUNTROWS('healthcare_dataset'), 0
) * 100
// Patients by Condition
Patients by Condition =
CALCULATE(
[Total Patients],
ALLEXCEPT('healthcare_dataset', 'healthcare_dataset'[Medical Condition])
)
// High Billing Patients (>$25K)
High Billing Count =
CALCULATE(
COUNTROWS('healthcare_dataset'),
'healthcare_dataset'[Billing Amount] > 25000
)
// Gender Split
Female Patients =
CALCULATE(
[Total Patients],
'healthcare_dataset'[Gender] = "Female"
)
Male Patients =
CALCULATE(
[Total Patients],
'healthcare_dataset'[Gender] = "Male"
)
Total Patients, Avg Age, Avg Billing Amount โ with conditional color formatting based on thresholds
Normal / Abnormal / Inconclusive โ bookmark-driven buttons that filter entire dashboard instantly
Unique visual feature โ toggle between Healthy and Unhealthy body images using Power BI bookmarks
Donut (Condition Distribution), 100% Stacked Bar (Test Results by Gender), Ribbon Chart (Age Trends), Line Chart (Patient Distribution)
Diabetes and Hypertension account for the largest patient share, with Abnormal test results highest in the 50-70 age group.
Cancer patients show 2.4x higher average billing vs other conditions โ critical for resource planning and insurance risk modeling.
Abnormal test results are slightly higher among male patients (54%) across most medical conditions in the dataset.
Patients aged 60+ represent 38% of all Abnormal results despite being only 22% of total patient volume.
100,000 Transactions
$2.5M in Losses
1% Overall
MySQL, Power BI, DAX
Credit card fraud costs businesses millions annually. The objective was to analyze 100,000 transactions to identify fraud patterns, detect high-risk times and locations, understand fraud methodologies, and build an interactive dashboard with real-time fraud detection capabilities to prevent financial losses.
CREATE DATABASE fraud_detection;
USE fraud_detection;
CREATE TABLE transactions (
transaction_id VARCHAR(50) PRIMARY KEY,
card_number VARCHAR(20),
transaction_date DATE,
transaction_time TIME,
amount DECIMAL(10,2),
merchant_name VARCHAR(100),
merchant_category VARCHAR(50),
location_city VARCHAR(100),
location_state VARCHAR(50),
transaction_type VARCHAR(20),
is_fraud BOOLEAN,
INDEX idx_date (transaction_date),
INDEX idx_fraud (is_fraud),
INDEX idx_location (location_city, location_state),
INDEX idx_amount (amount)
);
LOAD DATA LOCAL INFILE '/path/raw_transactions.csv'
INTO TABLE transactions
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;
-- Remove duplicates
DELETE t1 FROM transactions t1
INNER JOIN transactions t2
WHERE t1.transaction_id > t2.transaction_id
AND t1.card_number = t2.card_number
AND t1.transaction_date = t2.transaction_date
AND t1.amount = t2.amount;
-- Standardize transaction types
UPDATE transactions
SET transaction_type =
CASE
WHEN LOWER(transaction_type) IN ('purchase', 'buy') THEN 'Purchase'
WHEN LOWER(transaction_type) IN ('refund', 'return') THEN 'Refund'
WHEN LOWER(transaction_type) IN ('withdrawal', 'atm') THEN 'Withdrawal'
ELSE 'Other'
END;
-- Add derived columns
ALTER TABLE transactions ADD COLUMN hour_of_day INT;
ALTER TABLE transactions ADD COLUMN day_of_week VARCHAR(10);
ALTER TABLE transactions ADD COLUMN fraud_flag VARCHAR(10);
UPDATE transactions SET
hour_of_day = HOUR(transaction_time),
day_of_week = DAYNAME(transaction_date),
fraud_flag = IF(is_fraud = 1, 'Fraud', 'Legitimate');
-- 1. Overall Fraud Statistics
SELECT
COUNT(*) as Total_Transactions,
SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) as Fraud_Count,
ROUND(AVG(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) * 100, 2) as Fraud_Rate,
SUM(CASE WHEN is_fraud = 1 THEN amount ELSE 0 END) as Total_Fraud_Loss
FROM transactions;
-- 2. Fraud by Hour of Day
SELECT hour_of_day,
COUNT(*) as Total_Trans,
SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) as Fraud_Count,
ROUND(AVG(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) * 100, 2) as Fraud_Rate
FROM transactions
GROUP BY hour_of_day
ORDER BY Fraud_Rate DESC;
-- 3. High-Risk Locations
SELECT location_city, location_state,
SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) as Fraud_Count,
ROUND(AVG(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) * 100, 2) as Fraud_Rate,
SUM(CASE WHEN is_fraud = 1 THEN amount ELSE 0 END) as Fraud_Loss
FROM transactions
GROUP BY location_city, location_state
HAVING Fraud_Count > 10
ORDER BY Fraud_Loss DESC
LIMIT 10;
-- 4. Fraud by Transaction Type
SELECT transaction_type,
COUNT(*) as Total,
SUM(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) as Fraud_Count,
ROUND(AVG(CASE WHEN is_fraud = 1 THEN 1 ELSE 0 END) * 100, 2) as Fraud_Rate
FROM transactions
GROUP BY transaction_type
ORDER BY Fraud_Rate DESC;
-- 5. Cards with Multiple Fraud Attempts
SELECT card_number,
COUNT(*) as Fraud_Attempts,
SUM(amount) as Total_Fraud_Amount
FROM transactions
WHERE is_fraud = 1
GROUP BY card_number
HAVING COUNT(*) >= 3
ORDER BY Total_Fraud_Amount DESC;
// Total Transactions
Total Transactions = COUNTROWS(Transactions)
// Fraud Count
Fraud Count =
CALCULATE(
COUNTROWS(Transactions),
Transactions[is_fraud] = TRUE
)
// Fraud Rate %
Fraud Rate =
DIVIDE([Fraud Count], [Total Transactions], 0) * 100
// Total Fraud Loss
Total Fraud Loss =
CALCULATE(
SUM(Transactions[amount]),
Transactions[is_fraud] = TRUE
)
// Average Fraud Amount
Avg Fraud Amount =
CALCULATE(
AVERAGE(Transactions[amount]),
Transactions[is_fraud] = TRUE
)
// High Risk Hours (6 PM and 1 AM)
High Risk Trans =
CALCULATE(
COUNTROWS(Transactions),
Transactions[hour_of_day] IN {1, 18}
)
// Refund Fraud Rate
Refund Fraud Rate =
CALCULATE(
[Fraud Rate],
Transactions[transaction_type] = "Refund"
)
// Unique Cards Affected
Unique Cards Fraud =
CALCULATE(
DISTINCTCOUNT(Transactions[card_number]),
Transactions[is_fraud] = TRUE
)
1% overall fraud rate โ 1,000 fraudulent transactions out of 100,000, resulting in $2.5M potential losses.
Fraud peaks at 6 PM and 1 AM โ suggesting automated bot attacks during off-hours with less monitoring.
NYC and San Diego are top fraud locations โ 35% of all fraudulent transactions originate from these cities.
Refunds used more than purchases for fraud (60% vs 40%) โ fraudsters exploit refund policies with less scrutiny.
Open to remote contract & freelance projects โ Data Analysis, Power BI Dashboards, SQL & Business Intelligence.
๐ Remote ยท โฐ EST / PST Timezone ยท ๐ Immediate Start