Unlock Data Analysis: 150 Tips, Practical Code

A Comprehensive Guide to 150 Essential Data Analysis Tips

Part 1: Mindset, Setup, and Data Loading

• Define Your Question First
Explanation: Before writing any code, clearly state the business or research question you are trying to answer. This guides your entire analysis.

# Example Question: "Which product category has the highest average sales?"
    # This defines the goal: group by category, then calculate average sales.

• Use a Virtual Environment
Explanation: Isolates project dependencies, preventing conflicts between projects.

# In your terminal
    python -m venv my_analysis_env
    source my_analysis_env/bin/activate
    pip install pandas matplotlib seaborn

• Use Version Control (Git)
Explanation: Tracks changes to your code and notebooks, allowing you to revert to previous versions and collaborate effectively.

# In your terminal
    git init
    git add my_notebook.ipynb
    git commit -m "Initial data exploration"

• Document Assumptions
Explanation: Write down any assumptions you make about the data (e.g., "assuming sales are in USD," "treating missing values as zero").

# In a notebook markdown cell:
    # ## Assumptions
    # 1. Currency is in USD.
    # 2. Missing `return_date` implies the item was not returned.

• Start a Data Dictionary
Explanation: Create a simple file or table that explains what each column in your dataset means.

# Markdown in a notebook:
    # `user_id`: Unique identifier for the customer.
    # `order_amt`: Total amount of the order.
    # `is_first`: Boolean flag for the customer's first order.

• Load CSV Data with pd.read_csv()
Explanation: The primary function for reading comma-separated value files into a pandas DataFrame.

import pandas as pd
    df = pd.read_csv('data.csv')

• Specify the Separator
Explanation: Use the sep parameter if your file is delimited by something other than a comma (e.g., a tab or semicolon).

# For a tab-separated file (TSV)
    df = pd.read_csv('data.tsv', sep='\t')

• Load Excel Files with pd.read_excel()
Explanation: Used for reading data from .xls or .xlsx files. You can specify the sheet name.

# Needs the `openpyxl` library: pip install openpyxl
    df = pd.read_excel('data.xlsx', sheet_name='SalesData')

• Load Data from a SQL Database
Explanation: Directly query a database and load the results into a DataFrame for analysis.

from sqlalchemy import create_engine
    engine = create_engine('sqlite:///my_database.db')
    query = "SELECT * FROM sales;"
    df = pd.read_sql(query, engine)

• Specify Data Types on Load
Explanation: Use the dtype parameter to specify column types during loading. This saves memory and prevents incorrect type inference.

# Force 'user_id' to be a string (object) instead of a number
    df = pd.read_csv('data.csv', dtype={'user_id': str})

• Parse Dates on Load
Explanation: Use parse_dates to automatically convert one or more columns to datetime objects, which is crucial for time series analysis.

df = pd.read_csv('data.csv', parse_dates=['order_date'])

• Handle Large Files with chunksize
Explanation: Process large files in chunks instead of loading the entire file into memory at once.

total_sales = 0
    for chunk in pd.read_csv('large_sales_data.csv', chunksize=10000):
        total_sales += chunk['sales_amount'].sum()

Part 2: Initial Data Inspection & Exploration

• View the First Few Rows with .head()
Explanation: Quickly see the first n rows of your DataFrame to get a feel for the data and column names.

# Shows the first 5 rows by default
    print(df.head())

• View the Last Few Rows with .tail()
Explanation: Useful for checking if data was read correctly or for spotting trends at the end of a time-ordered dataset.

# Shows the last 3 rows
    print(df.tail(3))

• Check the DataFrame's Shape with .shape
Explanation: Returns a tuple representing the dimensions of the DataFrame (number of rows, number of columns).

# Example output: (10000, 15) -> 10,000 rows, 15 columns
    print(df.shape)

• Get a High-Level Summary with .info()
Explanation: Provides a concise summary including columns, non-null counts, data types, and memory usage.

# Essential first step to check for missing values and wrong dtypes
    df.info()

• Get Descriptive Statistics with .describe()
Explanation: Generates summary statistics (count, mean, std, min, max, quartiles) for all numerical columns.

# Provides a quick overview of the distribution of numerical data
    print(df.describe())

• Include Categorical Columns in .describe()
Explanation: Use include='object' to get summary statistics for non-numerical columns (count, unique, top, freq).

print(df.describe(include='object'))

• Check Data Types with .dtypes
Explanation: Returns a Series with the data type of each column.

# Useful for quickly verifying if columns were inferred correctly
    print(df.dtypes)

• List Column Names with .columns
Explanation: Returns the column labels of the DataFrame.

print(df.columns)

• Count Unique Values with .nunique()
Explanation: Get the number of unique values in each column.

# Helps identify categorical columns vs. unique identifiers
    print(df.nunique())

• Get Frequency Counts with .value_counts()
Explanation: For a specific column (Series), return a count of unique values, sorted in descending order.

# See the distribution of product categories
    print(df['category'].value_counts())

• Normalize Frequency Counts
Explanation: Use normalize=True with .value_counts() to see the relative frequencies (percentages) of the unique values.

# See the percentage of sales from each category
    print(df['category'].value_counts(normalize=True))

Part 3: Data Cleaning

#### Handling Missing Values

• Count Missing Values
Explanation: Use .isnull().sum() to get the total number of missing values (NaNs) in each column.

# A crucial step in data cleaning
    print(df.isnull().sum())

• Calculate Percentage of Missing Values
Explanation: Understand the proportion of missing data to decide on a handling strategy.

missing_percentage = df.isnull().sum() / len(df) * 100
    print(missing_percentage)

• Visualize Missing Data
Explanation: A heatmap can provide an intuitive overview of where missing data exists in your DataFrame.

import seaborn as sns
    sns.heatmap(df.isnull(), cbar=False, yticklabels=False, cmap='viridis')

• Drop Rows with Any Missing Values
Explanation: dropna() removes rows that contain at least one missing value. Use with caution as it can lead to data loss.

df_cleaned = df.dropna()

• Drop Columns with Any Missing Values
Explanation: Use axis=1 to drop columns that have missing values.

df_cleaned = df.dropna(axis=1)

• Drop Rows with Missing Values in Specific Columns
Explanation: Use the subset parameter to only consider certain columns when looking for NaNs.

# Only drop rows if 'email' or 'user_id' is missing
    df.dropna(subset=['email', 'user_id'], inplace=True)

• Fill Missing Values with a Static Value
Explanation: fillna() replaces NaN values with a specified value, like 0 or "Unknown".

# Fill missing prices with 0
    df['price'].fillna(0, inplace=True)

• Fill Missing Values with Mean/Median/Mode
Explanation: A common strategy to impute missing numerical or categorical data.

# Fill missing age with the median age
    median_age = df['age'].median()
    df['age'].fillna(median_age, inplace=True)

• Forward Fill (ffill)
Explanation: Propagates the last valid observation forward to fill NaNs. Useful in time series data.

# Fill missing stock prices with the price from the previous day
    df['stock_price'].fillna(method='ffill', inplace=True)

• Backward Fill (bfill)
Explanation: Propagates the next valid observation backward.

# Fill missing data with the next available data point
    df['temperature'].fillna(method='bfill', inplace=True)

• Interpolate Missing Values
Explanation: Fills NaNs with a value estimated from the surrounding data points. Good for time series with trends.

df['sensor_reading'].interpolate(method='linear', inplace=True)

• Create an Indicator for Missing Values
Explanation: Instead of filling, create a new boolean column that indicates whether the original value was missing. This preserves information.

df['age_is_missing'] = df['age'].isnull()

#### Correcting Data Types

• Change Column Type with .astype()
Explanation: Explicitly convert a column to a different data type.

# Convert 'user_id' from int to string (object)
    df['user_id'] = df['user_id'].astype(str)

• Convert to Numeric with pd.to_numeric()
Explanation: A robust way to convert a column to a numeric type, with options to handle errors.

# If a value can't be converted, it becomes NaN
    df['price'] = pd.to_numeric(df['price'], errors='coerce')

• Convert to Datetime with pd.to_datetime()
Explanation: Converts a column to datetime objects, enabling powerful time-based operations.

df['signup_date'] = pd.to_datetime(df['signup_date'])

• Extract Year/Month/Day from Datetime
Explanation: After converting to datetime, use the .dt accessor to extract components.

df['signup_year'] = df['signup_date'].dt.year
    df['signup_month'] = df['signup_date'].dt.month

• Extract Day of Week from Datetime
Explanation: Useful for analyzing weekly patterns.

# Monday=0, Sunday=6
    df['signup_dayofweek'] = df['signup_date'].dt.dayofweek

#### Handling Duplicates

• Check for Duplicate Rows
Explanation: .duplicated().sum() returns the number of duplicate rows in the DataFrame.

num_duplicates = df.duplicated().sum()
    print(f"Found {num_duplicates} duplicate rows.")

• View Duplicate Rows
Explanation: Filter the DataFrame to see the actual rows that are duplicates.

# `keep=False` shows all occurrences of duplicate rows
    print(df[df.duplicated(keep=False)])

• Drop Duplicate Rows
Explanation: .drop_duplicates() removes duplicate rows, keeping the first occurrence by default.

df.drop_duplicates(inplace=True)

• Drop Duplicates Based on a Subset of Columns
Explanation: Use the subset parameter to define uniqueness based on specific columns.

# Keep only the first order for each user
    df.drop_duplicates(subset=['user_id'], keep='first', inplace=True)

#### String Manipulation

• Convert Strings to Lowercase
Explanation: Use the .str accessor to apply string methods. Lowercasing is essential for consistent categorical data.

# 'USA' and 'usa' become the same category
    df['country'] = df['country'].str.lower()

• Remove Leading/Trailing Whitespace
Explanation: .str.strip() cleans up whitespace, which can cause issues with joins and grouping.

df['email'] = df['email'].str.strip()

• Replace Characters in a String
Explanation: .str.replace() is used to replace a substring or character with another.

# Remove dollar signs and commas from a price column
    df['price_str'] = df['price_str'].str.replace('$', '').str.replace(',', '')

• Check for Substring with .str.contains()
Explanation: Returns a boolean Series indicating if a substring is present. Useful for filtering.

# Find all products with 'premium' in their name
    premium_products = df[df['product_name'].str.contains('premium', case=False)]

• Split a String into Columns
Explanation: .str.split() splits a string by a delimiter, and expand=True creates new columns.

# Split 'full_name' into 'first_name' and 'last_name'
    df[['first_name', 'last_name']] = df['full_name'].str.split(' ', expand=True)

• Extract Substrings with Regex
Explanation: .str.extract() uses a regular expression with a capturing group to pull out specific patterns.

# Extract the number from strings like 'Product #123'
    df['product_id'] = df['product_code'].str.extract(r'#(\d+)')

Part 4: Data Wrangling & Feature Engineering

#### Filtering and Selection

• Select a Single Column
Explanation: Use bracket notation to select a column, which returns a pandas Series.

user_emails = df['email']

• Select Multiple Columns
Explanation: Pass a list of column names inside the bracket notation to select multiple columns.

user_info = df[['user_id', 'name', 'signup_date']]

• Select Rows with Boolean Indexing
Explanation: Create a boolean condition to filter rows.

# Select all users from the USA
    usa_users = df[df['country'] == 'usa']

• Combine Multiple Conditions
Explanation: Use & for AND and | for OR. Wrap each condition in parentheses.

# Active users from the USA
    active_usa_users = df[(df['country'] == 'usa') & (df['is_active'] == True)]

• Select with .loc (Label-based)
Explanation: Access a group of rows and columns by labels or a boolean condition. df.loc[rows, columns]

# Select the 'name' and 'email' for users with id > 100
    user_subset = df.loc[df['user_id'] > 100, ['name', 'email']]

• Select with .iloc (Integer-based)
Explanation: Access rows and columns by their integer position (index).

# Select the first 10 rows and the first 3 columns
    subset = df.iloc[0:10, 0:3]

• Filter with .isin()
Explanation: Select rows where a column's value is in a given list. More efficient than multiple OR conditions.

# Select users from a list of high-priority countries
    priority_countries = ['usa', 'canada', 'uk']
    priority_users = df[df['country'].isin(priority_countries)]

• Filter with .between()
Explanation: Select rows where a column's value is within a specified range (inclusive).

# Select orders with an amount between $100 and $500
    mid_value_orders = df[df['order_amount'].between(100, 500)]

• Filter with ~ (NOT)
Explanation: The tilde ~ operator negates a boolean condition.

# Select all users NOT from the USA
    non_usa_users = df[~(df['country'] == 'usa')]

• Filter Using .query()
Explanation: Allows you to filter a DataFrame using a query string, which can be more readable.

# Same as the boolean indexing example above
    active_usa_users = df.query("country == 'usa' and is_active == True")

#### Grouping and Aggregation

• Group Data with .groupby()
Explanation: Groups a DataFrame using one or more columns to prepare for aggregation.

# Group sales data by product category
    grouped_by_category = df.groupby('category')

• Perform a Single Aggregation
Explanation: After grouping, apply an aggregation function like .sum(), .mean(), .count().

# Calculate total sales for each category
    category_sales = df.groupby('category')['sales'].sum()

• Perform Multiple Aggregations with .agg()
Explanation: Apply several aggregation functions to one or more columns at once.

# Get total sales and average quantity per category
    category_summary = df.groupby('category').agg(
        total_sales=('sales', 'sum'),
        avg_quantity=('quantity', 'mean')
    )

• Reset the GroupBy Index
Explanation: By default, the grouping columns become the index. as_index=False keeps them as regular columns.

category_sales = df.groupby('category', as_index=False)['sales'].sum()

• Create a Pivot Table
Explanation: .pivot_table() is a powerful way to reshape and summarize data, similar to Excel.

# Summarize average sales by category (rows) and year (columns)
    pivot = pd.pivot_table(df, values='sales', index='category', columns='year', aggfunc='mean')

• Create a Frequency Table with .crosstab()
Explanation: Computes a cross-tabulation of two (or more) factors.

# See the count of orders across different categories and shipping methods
    xtab = pd.crosstab(df['category'], df['shipping_method'])

#### Merging and Joining

• Merge Two DataFrames with pd.merge()
Explanation: Combines two DataFrames based on common columns, similar to a SQL JOIN.

users_df = pd.DataFrame({'user_id': [1, 2], 'name': ['Alice', 'Bob']})
    orders_df = pd.DataFrame({'order_id': [101, 102], 'user_id': [1, 2]})
    
    merged_df = pd.merge(users_df, orders_df, on='user_id')

• Specify Join Type (how)
Explanation: Control the join logic: inner (default), left, right, or outer.

# Get all users, even if they have no orders (left join)
    all_users_orders = pd.merge(users_df, orders_df, on='user_id', how='left')

• Merge on Different Column Names
Explanation: Use left_on and right_on if the key columns have different names in the two DataFrames.

# users_df has 'id', orders_df has 'user_id'
    merged_df = pd.merge(users_df, orders_df, left_on='id', right_on='user_id')

• Concatenate DataFrames Vertically
Explanation: pd.concat() stacks DataFrames on top of each other. They must have the same columns.

df_jan = pd.read_csv('sales_jan.csv')
    df_feb = pd.read_csv('sales_feb.csv')
    
    total_sales_df = pd.concat([df_jan, df_feb])

#### Feature Engineering

• Create a New Column from Existing Ones
Explanation: Perform arithmetic or string operations on existing columns to create a new feature.

# Calculate the total price
    df['total_price'] = df['quantity'] * df['unit_price']

• Apply a Custom Function with .apply()
Explanation: Apply a function along an axis of the DataFrame. Can be used row-wise or column-wise.

def categorize_price(price):
        if price > 100: return 'High'
        elif price > 50: return 'Medium'
        else: return 'Low'
        
    df['price_category'] = df['unit_price'].apply(categorize_price)

• Use .map() for Value Replacement
Explanation: Substitute each value in a Series with another value, based on a dictionary.

df['priority_code'] = df['priority_name'].map({'High': 3, 'Medium': 2, 'Low': 1})

• Create Bins with pd.cut()
Explanation: Segment and sort data values into bins. Useful for converting continuous variables to categorical ones.

age_bins = [0, 18, 35, 60, 100]
    age_labels = ['Child', 'Young Adult', 'Adult', 'Senior']
    df['age_group'] = pd.cut(df['age'], bins=age_bins, labels=age_labels)

• Create Dummy Variables (One-Hot Encoding)
Explanation: pd.get_dummies() converts categorical variables into a format of 0s and 1s for use in machine learning models.

category_dummies = pd.get_dummies(df['category'], prefix='cat')
    df = pd.concat([df, category_dummies], axis=1)

Part 5: Visualization & Statistical Analysis

• Create a Histogram
Explanation: Visualizes the distribution of a single numerical variable.

import matplotlib.pyplot as plt
    df['age'].hist(bins=20)
    plt.show()

• Create a Bar Plot
Explanation: Compares a numerical value across different categories.

category_counts = df['category'].value_counts()
    category_counts.plot(kind='bar')
    plt.show()

• Create a Scatter Plot
Explanation: Visualizes the relationship between two numerical variables.

df.plot(kind='scatter', x='age', y='income')
    plt.show()

• Create a Box Plot
Explanation: Shows the distribution of data based on a five-number summary. Excellent for comparing distributions across categories and spotting outliers.

import seaborn as sns
    sns.boxplot(x='category', y='sales', data=df)
    plt.show()

• Create a Heatmap for Correlation
Explanation: Visualizes the correlation matrix, showing how strongly numerical variables are related to each other.

correlation_matrix = df.corr()
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
    plt.show()

• Add Titles and Labels to Plots
Explanation: Always label your plots to make them understandable.

df['age'].hist()
    plt.title('Distribution of User Ages')
    plt.xlabel('Age')
    plt.ylabel('Frequency')
    plt.show()

• Save a Plot to a File
Explanation: Use plt.savefig() to save your visualization.

df['age'].hist()
    plt.savefig('age_distribution.png', dpi=300)

• Calculate Correlation with .corr()
Explanation: Computes the pairwise correlation of columns, excluding NA/null values.

# Returns a correlation matrix
    print(df.corr())

Part 6: Best Practices & Advanced Tips (to 150)

• Rename Columns Cleanly
Explanation: Use a dictionary with .rename() for clarity.

df.rename(columns={'old_name': 'new_name', 'another': 'new_another'}, inplace=True)

• Chain Your Operations
Explanation: Chain pandas methods together for concise, readable code. Wrap in parentheses for multi-line formatting.

(df.dropna()
       .assign(new_col=df['col1'] * 2)
       .query("new_col > 10")
    )

• Use .assign() to Create New Columns
Explanation: A clean way to add one or more new columns, especially within a chain.

df = df.assign(
        col_c = df['col_a'] + df['col_b'],
        col_d = df['col_a'] - df['col_b']
    )

• Sort Values with .sort_values()
Explanation: Sort a DataFrame by one or more columns.

df.sort_values(by=['country', 'age'], ascending=[True, False], inplace=True)

• Reset the Index with .reset_index()
Explanation: After filtering or sorting, the index can become non-sequential. Reset it to a clean 0-based index.

df_filtered.reset_index(drop=True, inplace=True)

• Drop Unnecessary Columns
Explanation: Use .drop() with axis=1 to remove columns you no longer need, saving memory.

df.drop(['temp_col_1', 'temp_col_2'], axis=1, inplace=True)

• Copy DataFrames Explicitly
Explanation: To avoid SettingWithCopyWarning, use .copy() when you intend to modify a slice of a DataFrame.

df_subset = df[df['country'] == 'usa'].copy()
    df_subset['new_col'] = 1 # This is safe

• Change Display Options
Explanation: Configure pandas to show more rows, columns, or increase column width for better inspection in a notebook.

pd.set_option('display.max_rows', 100)
    pd.set_option('display.max_columns', 50)

• Use nlargest() and nsmallest()
Explanation: An efficient way to get the top or bottom n rows based on a column's values.

# Get the 10 highest-spending users
    top_10_spenders = df.nlargest(10, 'total_spending')

• Calculate Cumulative Sum
Explanation: .cumsum() is useful for tracking running totals, especially in time series data.

df['running_total_sales'] = df['daily_sales'].cumsum()

• Calculate Rolling Averages
Explanation: .rolling() provides rolling window calculations.

# 7-day moving average of sales
    df['sales_7_day_avg'] = df['daily_sales'].rolling(window=7).mean()

• Shift Data for Comparisons
Explanation: .shift() moves data up or down, allowing you to compare a value with its previous or next value.

# Calculate day-over-day change in sales
    df['sales_previous_day'] = df['daily_sales'].shift(1)

• Find the Rank of Data
Explanation: .rank() computes numerical data ranks (1 through n) along an axis.

df['sales_rank'] = df['sales_amount'].rank(method='dense', ascending=False)

• Use style for Better Visualization in Notebooks
Explanation: The .style accessor allows for conditional formatting of your DataFrame display.

# Highlight max values in each column
    df.style.highlight_max(axis=0)

• Save Processed Data
Explanation: After cleaning and wrangling, save the processed DataFrame to a new file to avoid re-running your cleaning script.

df_cleaned.to_csv('cleaned_data.csv', index=False)

• Use Parquet for Efficient Storage
Explanation: Parquet is a columnar storage format that is often much faster and more space-efficient than CSV.

# Needs `pyarrow`: pip install pyarrow
    df_cleaned.to_parquet('cleaned_data.parquet')

• Functionize Your Cleaning Steps
Explanation: Wrap repetitive cleaning and preprocessing steps into functions for reusability and clarity.

def clean_data(df):
        df['email'] = df['email'].str.strip().str.lower()
        # ... more steps
        return df
    
    df_clean = clean_data(df)

... (Continuing with more granular tips)

• Use np.where for Conditional Column Creation. A fast, vectorized alternative to .apply for simple if-else logic.
• Check for Inconsistent Categorical Values. (e.g., 'USA', 'U.S.A.', 'United States').
• Standardize Column Names. (e.g., convert to snake_case, remove special characters).
• Use pd.to_timedelta for Time Differences.
• Handle Unix Timestamps. Convert them to readable datetimes.
• Detect Outliers Using the IQR Method. (Q3 - Q1) * 1.5.
• Detect Outliers Using the Z-Score Method. (Value - Mean) / Std Dev.
• Clip Outliers. Cap values at a certain percentile instead of removing them.
• Use a Log Transformation for Skewed Data. Helps normalize the distribution for some models.
• Analyze Unique Combinations of Columns. df.groupby(['col1', 'col2']).size().
• Melt DataFrames. pd.melt() transforms a DataFrame from wide to long format.
• Set and Use a Meaningful Index. e.g., df.set_index('date') for time series.
• Handle Mixed-Type Columns. Investigate columns with dtype='object' that should be numeric.
• Use sample() for Large Datasets. Analyze a random sample to speed up initial exploration.
• Check for High Cardinality in Categorical Features. Columns with too many unique values may need special handling.
• Create Interaction Features. e.g., df['price_per_item'] = df['total_price'] / df['quantity'].
• Calculate Percentage Change. .pct_change() for time series.
• Use explode() for List-Like Entries. Transform each element of a list-like to a row.
• Use Pair Plots for Quick Relationship Overview. sns.pairplot(df).
• Use Violin Plots. A combination of a box plot and a kernel density estimate.
• Customize Plot Aesthetics with Seaborn. sns.set_style('whitegrid').
• Use Faceting to Create Subplots. sns.catplot(..., col='category').
• Annotate Plots for Clarity. Use plt.text() to add important notes to your visualizations.
• Use Assertions to Check Data Quality. assert df['id'].is_unique.
• Profile Your DataFrame with pandas-profiling for an automated EDA report.
• Optimize Memory Usage. Use smaller dtypes like int32 instead of int64, or category type for strings with few unique values.
• Use glob to Load Multiple Files. Find files by a pattern and load them in a loop.
• Be Wary of Data Leakage. Don't use information from the future to create features for the past.
• Understand Correlation vs. Causation. A high correlation does not imply one variable causes the other.
• Handle Timezones. Use .dt.tz_localize() and .dt.tz_convert().
• De-duplicate Based on a Time Window. For event data, you may need to define what counts as a duplicate event.
• Use Lambda Functions in .apply() for simple, one-off operations.
• Understand the inplace=True Argument. It modifies the DataFrame directly and returns None. Use with care.
• Use pipe() for Clean Function Application. df.pipe(clean_data).pipe(feature_engineer).
• Choose the Right Plot for Your Data. (Bar for categorical, histogram for numerical distribution, scatter for relationships).
• Analyze Text Data with Word Counts.
• Perform a Chi-Squared Test for Categorical Variable Independence.
• Perform a T-test to Compare Means of Two Groups.
• Explain Your Findings Clearly. The analysis is only useful if you can communicate it to stakeholders.
• Structure Your Notebooks with Markdown. Use headings, lists, and bold text to make it readable.
• Hide Code in Final Reports. Focus on the narrative and the visualizations.
• Always State the Source of Your Data.
• Be Skeptical of Your Own Results. Double-check your logic and look for alternative explanations.
• Use np.select for Complex Conditional Logic. A vectorized way to handle multiple if-elif-else conditions.
• Aggregate by Time. df.resample('M').sum() for monthly totals (requires a datetime index).
• Use .attrs to Store Metadata. Attach a dictionary of metadata to your DataFrame.
• Handle Character Encodings on Load. pd.read_csv('data.csv', encoding='latin1').
• Seek Peer Review. Ask a colleague to review your analysis for errors or missed insights.
• Keep a Log of Your Steps. This makes your analysis reproducible.
• Tell a Story with Your Data. Your final output should be a clear narrative that answers the initial question.

#DataAnalysis #Python #Pandas #DataScience #DataCleaning #EDA #TipsAndTricks