Welcome to the world of PandasAI, where data manipulation becomes a breeze! If you’ve ever found yourself struggling to work with large datasets or wished for an efficient way to analyse and process data, PandasAI is here to save the day.
In the realm of data analysis and artificial intelligence, the name “PandasAI” has been making waves as a game-changing technology. PandasAI is an innovative framework that combines the power of two popular tools: Pandas and AI.
As data-driven decision-making becomes increasingly critical in today’s fast-paced world, PandasAI’s integration of data analysis and AI will play a pivotal role in unlocking the full potential of data. From business intelligence and academic research to healthcare and finance, PandasAI has the potential to transform how organizations approach data analysis and AI integration.
In this blog, we will delve into the fascinating world of PandasAI and discover how to harness its powerful capabilities for efficient data manipulation and insightful analysis. Let’s explore the various functionalities of PandasAI and learn how to make the most of this game-changing tool. Get ready to unlock new possibilities in data-driven excellence with PandasAI!
What is PandasAI?
PandasAI is a powerful open-source library in Python designed to handle data with ease. It is like having a helpful companion that makes working with data a piece of cake. This library is designed to enhance the functionality of the popular Pandas library, which is widely used for data analysis and manipulation. It utilizes generative AI models to provide additional powerful features and capabilities to pandas, making it even more efficient and effective in handling data tasks.
Why do we need PandasAI?
Imagine you are a data analyst for a large multinational corporation that deals with massive amounts of data daily. Your company collects data from various sources such as customer transactions, website interactions, social media, and IoT devices. This data is used to make business decisions, such as identifying new market opportunities, improving customer service, and optimizing marketing campaigns.
Traditionally, data analysis has been a manual process. Data analysts would have to spend hours cleaning, formatting, and manipulating data in order to extract insights. This could be a time-consuming and error-prone process.
This is where PandasAI takes over, it can automate many of the tasks involved in data analysis. PandasAI uses natural language processing and machine learning to understand data and generate insights. This allows data analysts to focus on interpreting the results and making business decisions.
How does it work?
PandasAI aims to enable interactive communication with machines, allowing you to obtain desired results without the need for manual programming. To achieve this, the framework utilizes the OpenAI GPT API to automatically generate Python code with the Pandas library, which is then executed in the background to deliver the requested outcomes. This approach facilitates a more intuitive and user-friendly experience, as you the users can interact with the machine through natural language prompts, eliminating the need for writing code from scratch.
PandasAI works by extending the functionality of the popular data manipulation library Pandas in Python through the integration of advanced artificial intelligence (AI) techniques. It automates data preprocessing tasks such as handling missing values and outlier detection using AI algorithms, streamlining data cleaning processes. Additionally, PandasAI introduces AI-driven feature engineering, automatically generating relevant features from existing data to enhance the predictive power of machine learning models.
Simple Tutorial to get started with PandasAI
1. Preparing for PandasAI
Prior to proceeding with PandasAI, it is essential to ensure that your computer has all the required packages installed. If any of the necessary packages are missing, kindly install the following packages before moving forward with the process.
pip install numpy
pip install scipy
pip install matplotlib
pip install pandas
pip install openai
2. Get started by installing the PandasAI library on your system.
To begin your PandasAI journey, start by installing the PandasAI library on your computer. Just follow a simple installation process shown below, and you’ll gain access to all the fantastic features PandasAI has to offer.
pip install pandasai
3. Import the PandasAI library into your Python environment.
After the successful installation, import the PandasAI library into your Python script or Jupyter Notebook. The following libraries need to be imported
import pandas as pd
from pandasai import PandasAI
from pandasai.llm.openai import OpenAI
4. Load a DataFrame into your Python script or notebook to work with your data.
Now, it’s time to bring your data into a DataFrame. In this scenario, we have provided an example using the built-in Seaborn dataset called “tips.” However, feel free to load any dataset of your choice for analysis.
Input:
import seaborn as sb
df = sb.load_dataset('tips')
print(df.head())
Output:
5. Generate an API key to gain access to the full functionality of PandasAI.
To unlock the AI-driven capabilities of PandasAI, you’ll need to generate a unique API key. If you haven’t generated an API key yet follow the steps given below.
- Visit the OpenAI’s Platform website and log in using your OpenAI account credentials: https://platform.openai.com/overview
- Once logged in, click on your profile icon located at the top-right corner of the page and choose “View API Keys” from the dropdown menu.
- Click on “Create New Secret Key” to generate a fresh API key.
6. Initialize a PandasAI instance
With your DataFrame and API key ready, you can now initialize PandasAI in your code. This integration allows you to effortlessly combine AI-powered analysis with PandasAI’s data manipulation capabilities.
llm = OpenAI(api_token=key)
pandas_ai = PandasAI(llm)
7. Use various prompts to generate the outcome using PandasAI
Once PandasAI is initialized, you can use a variety of prompts to interact with your data and generate outcomes. These prompts are simple commands or questions that guide you through tasks like data filtering, sorting, grouping, and applying AI-driven analysis. By responding to these prompts, you can perform complex data operations with ease.
Here are several instances that demonstrate the practical applications of PandasAI.
- Generate simple straightforward prompts to find basic information about your dataset.
PandasAI allows users to interact with their dataset effortlessly using natural language prompts. By providing simple and straightforward prompts like “Show me the first few rows of the dataset” (head) or “Display a summary of the dataset’s information” (info), users can quickly gain insights into their data. Similarly, prompts like “Show the last rows of the dataset” (tail) allow you to view the end of the data and get a comprehensive understanding of the dataset’s structure. A few examples are displayed below
Prompt 1: To find the first rows of the data in substitution of the head() function of a DataFrame.
Input:
# To show only the first 5 details from the DataFrame
response = pandas_ai(df, "Show the first 5 rows of data in tabular form")
print(response)
Output:
Prompt 2: To find the description of the data in substitution of the describe() function of a DataFrame.
Input:
# To show the description of the data
response = pandas_ai(df, "Show the description of data in tabular form")
print(response)
Output:
Prompt 3: To find the number of rows and columns from the data in substitution of the shape() function of a DataFrame.
Input:
# To show the shape of the data
response = pandas_ai(df, "What is the shape of data?")
print(response)
Ouput:
244 7
2. PandasAI for Data Preprocessing
Data preprocessing is a crucial step in the data analysis pipeline, and PandasAI simplifies this process using advanced AI-driven techniques. You can utilize prompts such as “Handle missing values” or “Apply feature scaling” to automate data cleaning and transformation. With PandasAI, complex tasks like outlier detection, data imputation, and feature engineering become straightforward as users interact with the AI to perform these operations seamlessly.
Prompt 4:
Input:
response = pandas_ai(df, "Detect and handle outliers in the 'total_bill' column
using the IQR method.")
print(response)
Output:
3. Create sophisticated prompts that enable in-depth analysis and lead to more insightful outcomes.
PandasAI offers users the ability to generate advanced prompts for more comprehensive data analysis. Users can create prompts like “Calculate the average” or “Group columns to observe patterns.” These sophisticated prompts allow you to delve deeper into the dataset, perform complex aggregations, and uncover valuable insights that drive data-driven decision-making.
Prompt 5:
Input:
response = pandas_ai(df, "I am looking for a data points that has the highest tip")
print(response)
Output:
Prompt 6:
Input:
response = pandas_ai(df, "Investigate whether larger parties tend to leave higher
tips or spend more on average.")
print(response)
Output:
Prompt 7:
Input:
response = pandas_ai(df, '''Conduct a comprehensive analysis by aggregating
the data based on various combinations of the provided columns.''')
print(response)
Output:
4, Visualising the Data is also possible using PandasAI
In addition to data analysis and preprocessing, PandasAI supports data visualisation. Users can create prompts like “Generate a histogram” or “Create a scatter plot to explore the relationships”. With PandasAI’s integrated visualisation capabilities, you can efficiently generate plots and charts to gain a better understanding of their data and communicate findings effectively.
Prompt 8: We can display a Pie chart using PandasAI
Input:
response = pandas_ai(df, "Plot a pie chart the column smoker")
print(response)
Output:
Prompt 9: We can display a Histogram using PandasAI
Input:
response = pandas_ai(df, '''Calculate the tip percentage (tip / total_bill) for each
record and create a histogram to visualize its distribution.''')
print(response)
Output:
Prompt 10: We can display scatter plot using PandasAI
Input:
response = pandas_ai(df, "Generate a scatter plot to explore the relationship
between total_bill and tip amounts.")
print(response)
Output:
After utilizing PandasAI for the specific data analysis and manipulation tasks you require, you can derive meaningful conclusions from the processed data. Remember to interpret the results critically, consider the context, and draw actionable conclusions to maximize the benefits of using PandasAI in your data-driven projects.
NOTE:
- For effective utilization in PandasAI, the prompts provided must be precise, well-defined, and easily comprehensible for the AI system to interpret accurately. By presenting unambiguous and well-structured prompts, users can enhance the AI’s ability to perform the desired data transformations seamlessly. The clarity in prompts is crucial for PandasAI to streamline the data preprocessing process efficiently, empowering data professionals to achieve accurate and reliable results in their analysis and decision-making endeavours.
- Maintaining the latest versions of essential packages such as NumPy, SciPy, Matplotlib, Pandas, etc., is of utmost importance while utilizing PandasAI.
Interesting Use Case on PandasAI
Okay, let’s explore a practical application of PandasAI in a real-world scenario.
For the purpose of this example, I will be using sales data obtained from Kaggle. I have exclusively analyzed this data using PandasAI. Feel free to explore it independently.
Steps of execution:
- Importing Necessary Packages
- Loading Dataset
- Instantiating PandasAI
- Understanding the Dataset
- Data Pre-Processing
- Performing EDA
Step 1: Importing Necessary Packages
In this initial stage, we have imported all the essential libraries and packages needed for the comprehensive analysis. Notably, we have included PandasAI, which will be a primary tool for conducting various data manipulations and exploratory tasks throughout this overview.
import pandas as pd
from pandasai import PandasAI
from pandasai.llm.openai import OpenAI
Step 2: Loading Dataset As with any data analysis process, loading the dataset is a critical initial step. In this analysis, we will be working with real-time sales data sourced from Kaggle. The dataset contains valuable information related to sales transactions and will serve as the foundation for subsequent data exploration and insights generation.
df = pd.read_csv(‘sales.csv')
Step 3: Instantiating PandasAI
As mentioned in our tutorial, to utilize PandasAI, you are required to have an OpenAI API key, this API key enables access to the powerful capabilities of PandasAI for data analysis and manipulation.
llm = OpenAI(api_token=key)
pandas_ai = PandasAI(llm)
Step 4: Understanding the dataset using PandasAI
From a little glance at the data, it’s a Retail Dataset. The Retail dataset of a global superstore is a comprehensive dataset that covers sales transactions and related information for a global superstore over a period of four years. This dataset is a valuable resource for conducting in-depth analysis and gaining insights into the store’s performance, customer behaviour, product trends, and geographical reach. Let’s look into the dataset one by one using PandasAI
Input:
pandas_ai(df, "Give the information of the data in tabular form")
Output:
There are 17 columns and 9800 rows. Let’s look into it further
Input:
pandas_ai(df, "Show the first 5 rows of data in tabular form")
Output:
Let’s further take a look at if there are any missing values in our dataset.
Input:
pandas_ai(df, "Display the null values in a graph")
Output:
Now that we have familiarized ourselves with the dataset containing sales details of a store across various cities, countries, and regions over different time periods, the next step is to perform data cleaning. Data cleaning is essential to ensure the analysis proceeds smoothly and accurately. By cleaning the data, we will address any issues such as missing values, duplicates, and inconsistent data formats, thereby preparing the dataset for further exploration and insights generation.
Step 5: Data Pre-Processing
Data preprocessing is a crucial step in the data analysis pipeline. In our data we notice there are few missing values, and a few unnecessary columns for this analysis we will remove them. Lets see how this can be done using PandasAI
Step 5.1: Removing columns that we don’t need
Input:
pandas_ai(df, "Delete columns RowID, Customer ID, OrderID, Postal Code")
By just giving a prompt PandasAI will automatically make changes in our current DataFrame.
Step 5.2: Converting Order Date and separating them into year, month, day
Input:
pandas_ai(df, "Convert column Order Date to DatetimeIndex")
pandas_ai(df, "For the column Order Date make them three column seperating year,month, day")
pandas_ai(df, "Give the information of the data in tabular form")
Output:
As evident, the changes were effortlessly implemented with the assistance of PandasAI. After completing the pre-processing stage, the focus shifts to exploratory data analysis, enabling us to extract valuable insights from the data.
Step 6: Performing Exploratory Data Analysis
Step 6.1: First let us see the sales distribution across countries
Input:
pandas_ai(df, "Create a choropleth map using a suitable library (e.g., Plotly, Geopandas)
to visualize the sales distribution across country.")
Output:
Step 6.2: Top sales based on the city.
Having identified the countries with high sales in our given data, our focus now shifts to gaining a more detailed understanding of the sales distribution across various cities. Analyzing the sales distribution in different cities will provide us with insights into which cities are contributing significantly to the overall sales performance. This analysis will offer a more granular view of the store’s sales success and help identify the cities where the store’s products are most popular or in high demand.
Input:
pandas_ai(df, "I want to see the best sales based on City")
Output:
It’s always best to show a visual depiction when it comes to sales distribution since seeing letters and numbers could be boring.
Input:
pandas_ai(df, "Display top 10 best sales based on City in the form of a Bar Chart also show the values of each bar")
Output:
Step 6.3: Let’s look at the products that are most in demand.
Since we understood which city seems to have the highest sales, lets see the products category that are affecting this.
Input:
pandas_ai(df, "Create a bar plot to compare the sales across different product categories.")
Output:
Step 6.4: Let’s look at the products that are most in demand in various years.
Input:
pandas_ai(df, "Display side by side bar chart for various years, based on different product categories")
Output:
Step 6.4: Sales analysis per year, month, day.
Input:
pandas_ai(df, "Do a sales analysis per year, display it in the form of a line chart. Remove the year 2019")
Input:
pandas_ai(df, "Do a sales analysis per month, display it in the form of a line chart. Remove the year 2019")
Output:
Input:
pandas_ai(df, "Do a sales analysis per day, display it in the form of a line chart.Remove the year 2019")
Output:
Overall Conclusion:
In conclusion, the exploratory data analysis (EDA) has revealed compelling insights about the sales data. The analysis indicates that New York stores record the highest sales compared to other locations. Additionally, it is evident that the demand for technology products is notably higher among the various product categories. Notably, the year 2018 witnessed a substantial surge in overall demand for products. Moreover, there is a consistent increase in sales starting from the year 2016 and continuing onward. Furthermore, the end of the year, particularly during November and December, witnesses a significant spike in sales, suggesting a peak in consumer activity during the holiday season. These valuable findings from the EDA provide crucial information to guide strategic decision-making and capitalize on key trends in the sales data.
Just like in this example, PandasAI enables the execution of advanced data analysis and manipulation. Now, it’s your turn to explore and utilize its capabilities independently.
Conclusion:
In conclusion, PandasAI stands poised to revolutionize data analysis and artificial intelligence across various fields. By combining the power of Pandas, a popular data manipulation library, with the capabilities of AI, this innovative framework empowers data professionals and researchers to explore, analyze, and draw deeper insights from their data like never before.
In the hands of skilled data professionals and researchers, PandasAI will unlock a new era of insights, propelling fields forward and creating opportunities for innovation and discovery. As we embrace this game-changing technology, the possibilities are limitless, and the future of data-driven excellence looks brighter than ever. Embrace PandasAI today and be at the forefront of the data-driven revolution.
References: