Unlock Data Analysis: 150 Tips, Practical Code
CodeProgrammerA Comprehensive Guide to 150 Essential Data Analysis Tips
Part 1: Mindset, Setup, and Data Loading
• Define Your Question First
Explanation: Before writing any code, clearly state the business or research question you are trying to answer. This guides your entire analysis.
# Example Question: "Which product category has the highest average sales?"
# This defines the goal: group by category, then calculate average sales.• Use a Virtual Environment
Explanation: Isolates project dependencies, preventing conflicts between projects.
# In your terminal
python -m venv my_analysis_env
source my_analysis_env/bin/activate
pip install pandas matplotlib seaborn• Use Version Control (Git)
Explanation: Tracks changes to your code and notebooks, allowing you to revert to previous versions and collaborate effectively.
# In your terminal
git init
git add my_notebook.ipynb
git commit -m "Initial data exploration"• Document Assumptions
Explanation: Write down any assumptions you make about the data (e.g., "assuming sales are in USD," "treating missing values as zero").
# In a notebook markdown cell:
# ## Assumptions
# 1. Currency is in USD.
# 2. Missing `return_date` implies the item was not returned.• Start a Data Dictionary
Explanation: Create a simple file or table that explains what each column in your dataset means.
# Markdown in a notebook:
# `user_id`: Unique identifier for the customer.
# `order_amt`: Total amount of the order.
# `is_first`: Boolean flag for the customer's first order.• Load CSV Data with pd.read_csv()
Explanation: The primary function for reading comma-separated value files into a pandas DataFrame.
import pandas as pd
df = pd.read_csv('data.csv')• Specify the Separator
Explanation: Use the sep parameter if your file is delimited by something other than a comma (e.g., a tab or semicolon).
# For a tab-separated file (TSV)
df = pd.read_csv('data.tsv', sep='\t')• Load Excel Files with pd.read_excel()
Explanation: Used for reading data from .xls or .xlsx files. You can specify the sheet name.
# Needs the `openpyxl` library: pip install openpyxl
df = pd.read_excel('data.xlsx', sheet_name='SalesData')• Load Data from a SQL Database
Explanation: Directly query a database and load the results into a DataFrame for analysis.
from sqlalchemy import create_engine
engine = create_engine('sqlite:///my_database.db')
query = "SELECT * FROM sales;"
df = pd.read_sql(query, engine)• Specify Data Types on Load
Explanation: Use the dtype parameter to specify column types during loading. This saves memory and prevents incorrect type inference.
# Force 'user_id' to be a string (object) instead of a number
df = pd.read_csv('data.csv', dtype={'user_id': str})• Parse Dates on Load
Explanation: Use parse_dates to automatically convert one or more columns to datetime objects, which is crucial for time series analysis.
df = pd.read_csv('data.csv', parse_dates=['order_date'])• Handle Large Files with chunksize
Explanation: Process large files in chunks instead of loading the entire file into memory at once.
total_sales = 0
for chunk in pd.read_csv('large_sales_data.csv', chunksize=10000):
total_sales += chunk['sales_amount'].sum()Part 2: Initial Data Inspection & Exploration
• View the First Few Rows with .head()
Explanation: Quickly see the first n rows of your DataFrame to get a feel for the data and column names.
# Shows the first 5 rows by default
print(df.head())• View the Last Few Rows with .tail()
Explanation: Useful for checking if data was read correctly or for spotting trends at the end of a time-ordered dataset.
# Shows the last 3 rows
print(df.tail(3))• Check the DataFrame's Shape with .shape
Explanation: Returns a tuple representing the dimensions of the DataFrame (number of rows, number of columns).
# Example output: (10000, 15) -> 10,000 rows, 15 columns
print(df.shape)• Get a High-Level Summary with .info()
Explanation: Provides a concise summary including columns, non-null counts, data types, and memory usage.
# Essential first step to check for missing values and wrong dtypes
df.info()• Get Descriptive Statistics with .describe()
Explanation: Generates summary statistics (count, mean, std, min, max, quartiles) for all numerical columns.
# Provides a quick overview of the distribution of numerical data
print(df.describe())• Include Categorical Columns in .describe()
Explanation: Use include='object' to get summary statistics for non-numerical columns (count, unique, top, freq).
print(df.describe(include='object'))• Check Data Types with .dtypes
Explanation: Returns a Series with the data type of each column.
# Useful for quickly verifying if columns were inferred correctly
print(df.dtypes)• List Column Names with .columns
Explanation: Returns the column labels of the DataFrame.
print(df.columns)• Count Unique Values with .nunique()
Explanation: Get the number of unique values in each column.
# Helps identify categorical columns vs. unique identifiers
print(df.nunique())• Get Frequency Counts with .value_counts()
Explanation: For a specific column (Series), return a count of unique values, sorted in descending order.
# See the distribution of product categories
print(df['category'].value_counts())• Normalize Frequency Counts
Explanation: Use normalize=True with .value_counts() to see the relative frequencies (percentages) of the unique values.
# See the percentage of sales from each category
print(df['category'].value_counts(normalize=True))Part 3: Data Cleaning
#### Handling Missing Values
• Count Missing Values
Explanation: Use .isnull().sum() to get the total number of missing values (NaNs) in each column.
# A crucial step in data cleaning
print(df.isnull().sum())• Calculate Percentage of Missing Values
Explanation: Understand the proportion of missing data to decide on a handling strategy.
missing_percentage = df.isnull().sum() / len(df) * 100
print(missing_percentage)• Visualize Missing Data
Explanation: A heatmap can provide an intuitive overview of where missing data exists in your DataFrame.
import seaborn as sns
sns.heatmap(df.isnull(), cbar=False, yticklabels=False, cmap='viridis')• Drop Rows with Any Missing Values
Explanation: dropna() removes rows that contain at least one missing value. Use with caution as it can lead to data loss.
df_cleaned = df.dropna()• Drop Columns with Any Missing Values
Explanation: Use axis=1 to drop columns that have missing values.
df_cleaned = df.dropna(axis=1)• Drop Rows with Missing Values in Specific Columns
Explanation: Use the subset parameter to only consider certain columns when looking for NaNs.
# Only drop rows if 'email' or 'user_id' is missing
df.dropna(subset=['email', 'user_id'], inplace=True)• Fill Missing Values with a Static Value
Explanation: fillna() replaces NaN values with a specified value, like 0 or "Unknown".
# Fill missing prices with 0
df['price'].fillna(0, inplace=True)• Fill Missing Values with Mean/Median/Mode
Explanation: A common strategy to impute missing numerical or categorical data.
# Fill missing age with the median age
median_age = df['age'].median()
df['age'].fillna(median_age, inplace=True)• Forward Fill (ffill)
Explanation: Propagates the last valid observation forward to fill NaNs. Useful in time series data.
# Fill missing stock prices with the price from the previous day
df['stock_price'].fillna(method='ffill', inplace=True)• Backward Fill (bfill)
Explanation: Propagates the next valid observation backward.
# Fill missing data with the next available data point
df['temperature'].fillna(method='bfill', inplace=True)• Interpolate Missing Values
Explanation: Fills NaNs with a value estimated from the surrounding data points. Good for time series with trends.
df['sensor_reading'].interpolate(method='linear', inplace=True)• Create an Indicator for Missing Values
Explanation: Instead of filling, create a new boolean column that indicates whether the original value was missing. This preserves information.
df['age_is_missing'] = df['age'].isnull()#### Correcting Data Types
• Change Column Type with .astype()
Explanation: Explicitly convert a column to a different data type.
# Convert 'user_id' from int to string (object)
df['user_id'] = df['user_id'].astype(str)• Convert to Numeric with pd.to_numeric()
Explanation: A robust way to convert a column to a numeric type, with options to handle errors.
# If a value can't be converted, it becomes NaN
df['price'] = pd.to_numeric(df['price'], errors='coerce')• Convert to Datetime with pd.to_datetime()
Explanation: Converts a column to datetime objects, enabling powerful time-based operations.
df['signup_date'] = pd.to_datetime(df['signup_date'])• Extract Year/Month/Day from Datetime
Explanation: After converting to datetime, use the .dt accessor to extract components.
df['signup_year'] = df['signup_date'].dt.year
df['signup_month'] = df['signup_date'].dt.month• Extract Day of Week from Datetime
Explanation: Useful for analyzing weekly patterns.
# Monday=0, Sunday=6
df['signup_dayofweek'] = df['signup_date'].dt.dayofweek#### Handling Duplicates
• Check for Duplicate Rows
Explanation: .duplicated().sum() returns the number of duplicate rows in the DataFrame.
num_duplicates = df.duplicated().sum()
print(f"Found {num_duplicates} duplicate rows.")• View Duplicate Rows
Explanation: Filter the DataFrame to see the actual rows that are duplicates.
# `keep=False` shows all occurrences of duplicate rows
print(df[df.duplicated(keep=False)])• Drop Duplicate Rows
Explanation: .drop_duplicates() removes duplicate rows, keeping the first occurrence by default.
df.drop_duplicates(inplace=True)• Drop Duplicates Based on a Subset of Columns
Explanation: Use the subset parameter to define uniqueness based on specific columns.
# Keep only the first order for each user
df.drop_duplicates(subset=['user_id'], keep='first', inplace=True)#### String Manipulation
• Convert Strings to Lowercase
Explanation: Use the .str accessor to apply string methods. Lowercasing is essential for consistent categorical data.
# 'USA' and 'usa' become the same category
df['country'] = df['country'].str.lower()• Remove Leading/Trailing Whitespace
Explanation: .str.strip() cleans up whitespace, which can cause issues with joins and grouping.
df['email'] = df['email'].str.strip()• Replace Characters in a String
Explanation: .str.replace() is used to replace a substring or character with another.
# Remove dollar signs and commas from a price column
df['price_str'] = df['price_str'].str.replace('$', '').str.replace(',', '')• Check for Substring with .str.contains()
Explanation: Returns a boolean Series indicating if a substring is present. Useful for filtering.
# Find all products with 'premium' in their name
premium_products = df[df['product_name'].str.contains('premium', case=False)]• Split a String into Columns
Explanation: .str.split() splits a string by a delimiter, and expand=True creates new columns.
# Split 'full_name' into 'first_name' and 'last_name'
df[['first_name', 'last_name']] = df['full_name'].str.split(' ', expand=True)• Extract Substrings with Regex
Explanation: .str.extract() uses a regular expression with a capturing group to pull out specific patterns.
# Extract the number from strings like 'Product #123'
df['product_id'] = df['product_code'].str.extract(r'#(\d+)')Part 4: Data Wrangling & Feature Engineering
#### Filtering and Selection
• Select a Single Column
Explanation: Use bracket notation to select a column, which returns a pandas Series.
user_emails = df['email']• Select Multiple Columns
Explanation: Pass a list of column names inside the bracket notation to select multiple columns.
user_info = df[['user_id', 'name', 'signup_date']]• Select Rows with Boolean Indexing
Explanation: Create a boolean condition to filter rows.
# Select all users from the USA
usa_users = df[df['country'] == 'usa']• Combine Multiple Conditions
Explanation: Use & for AND and | for OR. Wrap each condition in parentheses.
# Active users from the USA
active_usa_users = df[(df['country'] == 'usa') & (df['is_active'] == True)]• Select with .loc (Label-based)
Explanation: Access a group of rows and columns by labels or a boolean condition. df.loc[rows, columns]
# Select the 'name' and 'email' for users with id > 100
user_subset = df.loc[df['user_id'] > 100, ['name', 'email']]• Select with .iloc (Integer-based)
Explanation: Access rows and columns by their integer position (index).
# Select the first 10 rows and the first 3 columns
subset = df.iloc[0:10, 0:3]• Filter with .isin()
Explanation: Select rows where a column's value is in a given list. More efficient than multiple OR conditions.
# Select users from a list of high-priority countries
priority_countries = ['usa', 'canada', 'uk']
priority_users = df[df['country'].isin(priority_countries)]• Filter with .between()
Explanation: Select rows where a column's value is within a specified range (inclusive).
# Select orders with an amount between $100 and $500
mid_value_orders = df[df['order_amount'].between(100, 500)]• Filter with ~ (NOT)
Explanation: The tilde ~ operator negates a boolean condition.
# Select all users NOT from the USA
non_usa_users = df[~(df['country'] == 'usa')]• Filter Using .query()
Explanation: Allows you to filter a DataFrame using a query string, which can be more readable.
# Same as the boolean indexing example above
active_usa_users = df.query("country == 'usa' and is_active == True")#### Grouping and Aggregation
• Group Data with .groupby()
Explanation: Groups a DataFrame using one or more columns to prepare for aggregation.
# Group sales data by product category
grouped_by_category = df.groupby('category')• Perform a Single Aggregation
Explanation: After grouping, apply an aggregation function like .sum(), .mean(), .count().
# Calculate total sales for each category
category_sales = df.groupby('category')['sales'].sum()• Perform Multiple Aggregations with .agg()
Explanation: Apply several aggregation functions to one or more columns at once.
# Get total sales and average quantity per category
category_summary = df.groupby('category').agg(
total_sales=('sales', 'sum'),
avg_quantity=('quantity', 'mean')
)• Reset the GroupBy Index
Explanation: By default, the grouping columns become the index. as_index=False keeps them as regular columns.
category_sales = df.groupby('category', as_index=False)['sales'].sum()• Create a Pivot Table
Explanation: .pivot_table() is a powerful way to reshape and summarize data, similar to Excel.
# Summarize average sales by category (rows) and year (columns)
pivot = pd.pivot_table(df, values='sales', index='category', columns='year', aggfunc='mean')• Create a Frequency Table with .crosstab()
Explanation: Computes a cross-tabulation of two (or more) factors.
# See the count of orders across different categories and shipping methods
xtab = pd.crosstab(df['category'], df['shipping_method'])#### Merging and Joining
• Merge Two DataFrames with pd.merge()
Explanation: Combines two DataFrames based on common columns, similar to a SQL JOIN.
users_df = pd.DataFrame({'user_id': [1, 2], 'name': ['Alice', 'Bob']})
orders_df = pd.DataFrame({'order_id': [101, 102], 'user_id': [1, 2]})
merged_df = pd.merge(users_df, orders_df, on='user_id')• Specify Join Type (how)
Explanation: Control the join logic: inner (default), left, right, or outer.
# Get all users, even if they have no orders (left join)
all_users_orders = pd.merge(users_df, orders_df, on='user_id', how='left')• Merge on Different Column Names
Explanation: Use left_on and right_on if the key columns have different names in the two DataFrames.
# users_df has 'id', orders_df has 'user_id'
merged_df = pd.merge(users_df, orders_df, left_on='id', right_on='user_id')• Concatenate DataFrames Vertically
Explanation: pd.concat() stacks DataFrames on top of each other. They must have the same columns.
df_jan = pd.read_csv('sales_jan.csv')
df_feb = pd.read_csv('sales_feb.csv')
total_sales_df = pd.concat([df_jan, df_feb])#### Feature Engineering
• Create a New Column from Existing Ones
Explanation: Perform arithmetic or string operations on existing columns to create a new feature.
# Calculate the total price
df['total_price'] = df['quantity'] * df['unit_price']• Apply a Custom Function with .apply()
Explanation: Apply a function along an axis of the DataFrame. Can be used row-wise or column-wise.
def categorize_price(price):
if price > 100: return 'High'
elif price > 50: return 'Medium'
else: return 'Low'
df['price_category'] = df['unit_price'].apply(categorize_price)• Use .map() for Value Replacement
Explanation: Substitute each value in a Series with another value, based on a dictionary.
df['priority_code'] = df['priority_name'].map({'High': 3, 'Medium': 2, 'Low': 1})• Create Bins with pd.cut()
Explanation: Segment and sort data values into bins. Useful for converting continuous variables to categorical ones.
age_bins = [0, 18, 35, 60, 100]
age_labels = ['Child', 'Young Adult', 'Adult', 'Senior']
df['age_group'] = pd.cut(df['age'], bins=age_bins, labels=age_labels)• Create Dummy Variables (One-Hot Encoding)
Explanation: pd.get_dummies() converts categorical variables into a format of 0s and 1s for use in machine learning models.
category_dummies = pd.get_dummies(df['category'], prefix='cat')
df = pd.concat([df, category_dummies], axis=1)Part 5: Visualization & Statistical Analysis
• Create a Histogram
Explanation: Visualizes the distribution of a single numerical variable.
import matplotlib.pyplot as plt
df['age'].hist(bins=20)
plt.show()• Create a Bar Plot
Explanation: Compares a numerical value across different categories.
category_counts = df['category'].value_counts()
category_counts.plot(kind='bar')
plt.show()• Create a Scatter Plot
Explanation: Visualizes the relationship between two numerical variables.
df.plot(kind='scatter', x='age', y='income')
plt.show()• Create a Box Plot
Explanation: Shows the distribution of data based on a five-number summary. Excellent for comparing distributions across categories and spotting outliers.
import seaborn as sns
sns.boxplot(x='category', y='sales', data=df)
plt.show()• Create a Heatmap for Correlation
Explanation: Visualizes the correlation matrix, showing how strongly numerical variables are related to each other.
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()• Add Titles and Labels to Plots
Explanation: Always label your plots to make them understandable.
df['age'].hist()
plt.title('Distribution of User Ages')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()• Save a Plot to a File
Explanation: Use plt.savefig() to save your visualization.
df['age'].hist()
plt.savefig('age_distribution.png', dpi=300)• Calculate Correlation with .corr()
Explanation: Computes the pairwise correlation of columns, excluding NA/null values.
# Returns a correlation matrix
print(df.corr())Part 6: Best Practices & Advanced Tips (to 150)
• Rename Columns Cleanly
Explanation: Use a dictionary with .rename() for clarity.
df.rename(columns={'old_name': 'new_name', 'another': 'new_another'}, inplace=True)• Chain Your Operations
Explanation: Chain pandas methods together for concise, readable code. Wrap in parentheses for multi-line formatting.
(df.dropna()
.assign(new_col=df['col1'] * 2)
.query("new_col > 10")
)• Use .assign() to Create New Columns
Explanation: A clean way to add one or more new columns, especially within a chain.
df = df.assign(
col_c = df['col_a'] + df['col_b'],
col_d = df['col_a'] - df['col_b']
)• Sort Values with .sort_values()
Explanation: Sort a DataFrame by one or more columns.
df.sort_values(by=['country', 'age'], ascending=[True, False], inplace=True)• Reset the Index with .reset_index()
Explanation: After filtering or sorting, the index can become non-sequential. Reset it to a clean 0-based index.
df_filtered.reset_index(drop=True, inplace=True)• Drop Unnecessary Columns
Explanation: Use .drop() with axis=1 to remove columns you no longer need, saving memory.
df.drop(['temp_col_1', 'temp_col_2'], axis=1, inplace=True)• Copy DataFrames Explicitly
Explanation: To avoid SettingWithCopyWarning, use .copy() when you intend to modify a slice of a DataFrame.
df_subset = df[df['country'] == 'usa'].copy()
df_subset['new_col'] = 1 # This is safe• Change Display Options
Explanation: Configure pandas to show more rows, columns, or increase column width for better inspection in a notebook.
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 50)• Use nlargest() and nsmallest()
Explanation: An efficient way to get the top or bottom n rows based on a column's values.
# Get the 10 highest-spending users
top_10_spenders = df.nlargest(10, 'total_spending')• Calculate Cumulative Sum
Explanation: .cumsum() is useful for tracking running totals, especially in time series data.
df['running_total_sales'] = df['daily_sales'].cumsum()• Calculate Rolling Averages
Explanation: .rolling() provides rolling window calculations.
# 7-day moving average of sales
df['sales_7_day_avg'] = df['daily_sales'].rolling(window=7).mean()• Shift Data for Comparisons
Explanation: .shift() moves data up or down, allowing you to compare a value with its previous or next value.
# Calculate day-over-day change in sales
df['sales_previous_day'] = df['daily_sales'].shift(1)• Find the Rank of Data
Explanation: .rank() computes numerical data ranks (1 through n) along an axis.
df['sales_rank'] = df['sales_amount'].rank(method='dense', ascending=False)• Use style for Better Visualization in Notebooks
Explanation: The .style accessor allows for conditional formatting of your DataFrame display.
# Highlight max values in each column
df.style.highlight_max(axis=0)• Save Processed Data
Explanation: After cleaning and wrangling, save the processed DataFrame to a new file to avoid re-running your cleaning script.
df_cleaned.to_csv('cleaned_data.csv', index=False)• Use Parquet for Efficient Storage
Explanation: Parquet is a columnar storage format that is often much faster and more space-efficient than CSV.
# Needs `pyarrow`: pip install pyarrow
df_cleaned.to_parquet('cleaned_data.parquet')• Functionize Your Cleaning Steps
Explanation: Wrap repetitive cleaning and preprocessing steps into functions for reusability and clarity.
def clean_data(df):
df['email'] = df['email'].str.strip().str.lower()
# ... more steps
return df
df_clean = clean_data(df)... (Continuing with more granular tips)
• Use np.where for Conditional Column Creation. A fast, vectorized alternative to .apply for simple if-else logic.
• Check for Inconsistent Categorical Values. (e.g., 'USA', 'U.S.A.', 'United States').
• Standardize Column Names. (e.g., convert to snake_case, remove special characters).
• Use pd.to_timedelta for Time Differences.
• Handle Unix Timestamps. Convert them to readable datetimes.
• Detect Outliers Using the IQR Method. (Q3 - Q1) * 1.5.
• Detect Outliers Using the Z-Score Method. (Value - Mean) / Std Dev.
• Clip Outliers. Cap values at a certain percentile instead of removing them.
• Use a Log Transformation for Skewed Data. Helps normalize the distribution for some models.
• Analyze Unique Combinations of Columns. df.groupby(['col1', 'col2']).size().
• Melt DataFrames. pd.melt() transforms a DataFrame from wide to long format.
• Set and Use a Meaningful Index. e.g., df.set_index('date') for time series.
• Handle Mixed-Type Columns. Investigate columns with dtype='object' that should be numeric.
• Use sample() for Large Datasets. Analyze a random sample to speed up initial exploration.
• Check for High Cardinality in Categorical Features. Columns with too many unique values may need special handling.
• Create Interaction Features. e.g., df['price_per_item'] = df['total_price'] / df['quantity'].
• Calculate Percentage Change. .pct_change() for time series.
• Use explode() for List-Like Entries. Transform each element of a list-like to a row.
• Use Pair Plots for Quick Relationship Overview. sns.pairplot(df).
• Use Violin Plots. A combination of a box plot and a kernel density estimate.
• Customize Plot Aesthetics with Seaborn. sns.set_style('whitegrid').
• Use Faceting to Create Subplots. sns.catplot(..., col='category').
• Annotate Plots for Clarity. Use plt.text() to add important notes to your visualizations.
• Use Assertions to Check Data Quality. assert df['id'].is_unique.
• Profile Your DataFrame with pandas-profiling for an automated EDA report.
• Optimize Memory Usage. Use smaller dtypes like int32 instead of int64, or category type for strings with few unique values.
• Use glob to Load Multiple Files. Find files by a pattern and load them in a loop.
• Be Wary of Data Leakage. Don't use information from the future to create features for the past.
• Understand Correlation vs. Causation. A high correlation does not imply one variable causes the other.
• Handle Timezones. Use .dt.tz_localize() and .dt.tz_convert().
• De-duplicate Based on a Time Window. For event data, you may need to define what counts as a duplicate event.
• Use Lambda Functions in .apply() for simple, one-off operations.
• Understand the inplace=True Argument. It modifies the DataFrame directly and returns None. Use with care.
• Use pipe() for Clean Function Application. df.pipe(clean_data).pipe(feature_engineer).
• Choose the Right Plot for Your Data. (Bar for categorical, histogram for numerical distribution, scatter for relationships).
• Analyze Text Data with Word Counts.
• Perform a Chi-Squared Test for Categorical Variable Independence.
• Perform a T-test to Compare Means of Two Groups.
• Explain Your Findings Clearly. The analysis is only useful if you can communicate it to stakeholders.
• Structure Your Notebooks with Markdown. Use headings, lists, and bold text to make it readable.
• Hide Code in Final Reports. Focus on the narrative and the visualizations.
• Always State the Source of Your Data.
• Be Skeptical of Your Own Results. Double-check your logic and look for alternative explanations.
• Use np.select for Complex Conditional Logic. A vectorized way to handle multiple if-elif-else conditions.
• Aggregate by Time. df.resample('M').sum() for monthly totals (requires a datetime index).
• Use .attrs to Store Metadata. Attach a dictionary of metadata to your DataFrame.
• Handle Character Encodings on Load. pd.read_csv('data.csv', encoding='latin1').
• Seek Peer Review. Ask a colleague to review your analysis for errors or missed insights.
• Keep a Log of Your Steps. This makes your analysis reproducible.
• Tell a Story with Your Data. Your final output should be a clear narrative that answers the initial question.
#DataAnalysis #Python #Pandas #DataScience #DataCleaning #EDA #TipsAndTricks