Notes

  • Pandas allows for easy organization and reading of data
  • Datasets must be clean before feeding them to a computer
    • this can be improved with pandas
  • When creating or using a dataset you must consider;
    • Does it have a good sample size?
    • Is there bias in the data?
    • Does the data set need to be cleaned?
    • What is the purpose of the data set?

Pandas and DataFrames

In this lesson we will be exploring data analysis using Pandas.

  • College Board talks about ideas like
    • Tools. "the ability to process data depends on users capabilities and their tools"
    • Combining Data. "combine county data sets"
    • Status on Data"determining the artist with the greatest attendance during a particular month"
    • Data poses challenge. "the need to clean data", "incomplete data"
  • From Pandas Overview -- When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.

DataFrame

'''Pandas is used to gather data sets through its DataFrames implementation'''
import pandas as pd

Cleaning Data

When looking at a data set, check to see what data needs to be cleaned. Examples include:

  • Missing Data Points
  • Invalid Data
  • Inaccurate Data

Run the following code to see what needs to be cleaned

  • This code defines 'df'(dataframe) by getting the data from 'grade.json'
df = pd.read_json('files/grade.json')

print(df)
# What part of the data set needs to be cleaned?
# The grades need to be checked, and the student id needs to not be "nil"

# From PBL learning, what is a good time to clean data?  Hint, remember Garbage in, Garbage out?
# you should really check the input because bad input creates bad output
   Student ID Year in School   GPA
0         123             12  3.57
1         246             10  4.00
2         578             12  2.78
3         469             11  3.45
4         324         Junior  4.75
5         313             20  3.33
6         145             12  2.95
7         167             10  3.90
8         235      9th Grade  3.15
9         nil              9  2.80
10        469             11  3.45
11        456             10  2.75

Extracting Info

Take a look at some features that the Pandas library has that extracts info from the dataset

DataFrame Extract Column

  • This code uses data from the .json file; 'grade.json' to output the student ID's and GPA's with or without an index
print(df[['GPA']])

print()

#try two columns and remove the index from print statement
print(df[['Student ID','GPA']].to_string(index=False))
     GPA
0   3.57
1   4.00
2   2.78
3   3.45
4   4.75
5   3.33
6   2.95
7   3.90
8   3.15
9   2.80
10  3.45
11  2.75

Student ID  GPA
       123 3.57
       246 4.00
       578 2.78
       469 3.45
       324 4.75
       313 3.33
       145 2.95
       167 3.90
       235 3.15
       nil 2.80
       469 3.45
       456 2.75

DataFrame Sort

  • The following code hows to output data by sorting from ascending or descending order by using the 'sort_values' function.
print(df.sort_values(by=['GPA']))

print()

#sort the values in reverse order
print(df.sort_values(by=['GPA'], ascending=False))
   Student ID Year in School   GPA
11        456             10  2.75
2         578             12  2.78
9         nil              9  2.80
6         145             12  2.95
8         235      9th Grade  3.15
5         313             20  3.33
3         469             11  3.45
10        469             11  3.45
0         123             12  3.57
7         167             10  3.90
1         246             10  4.00
4         324         Junior  4.75

   Student ID Year in School   GPA
4         324         Junior  4.75
1         246             10  4.00
7         167             10  3.90
0         123             12  3.57
3         469             11  3.45
10        469             11  3.45
5         313             20  3.33
8         235      9th Grade  3.15
6         145             12  2.95
9         nil              9  2.80
2         578             12  2.78
11        456             10  2.75

DataFrame Selection or Filter

  • This simple code only prints data with GPA greater than 3.00
print(df[df.GPA > 3.00])
   Student ID Year in School   GPA
0         123             12  3.57
1         246             10  4.00
3         469             11  3.45
4         324         Junior  4.75
5         313             20  3.33
7         167             10  3.90
8         235      9th Grade  3.15
10        469             11  3.45

DataFrame Selection Max and Min

  • similarly to the last code block, this code block selectively only displays the maximum and minimum GPA's.
print(df[df.GPA == df.GPA.max()])
print()
print(df[df.GPA == df.GPA.min()])
  Student ID Year in School   GPA
4        324         Junior  4.75

   Student ID Year in School   GPA
11        456             10  2.75

Create your own DataFrame

Using Pandas allows you to create your own DataFrame in Python.

Python Dictionary to Pandas DataFrame

  • This code block uses the pandas module to create a dataset from scratch, then the data is printed below.
import pandas as pd

#the data can be stored as a python dictionary
dict = {
  "calories": [420, 380, 390, 100000],
  "duration": [50, 40, 45, 1],
  "before": [150, 139, 176, 200],
  "after": [148, 138, 175, 1]

}
#stores the data in a data frame
print("-------------Dict_to_DF------------------")
df = pd.DataFrame(dict)
print(df)

print("----------Dict_to_DF_labels--------------")

#or with the index argument, you can label rows.
df = pd.DataFrame(dict, index = ["day1", "day2", "day3", "day4"])
print(df)
-------------Dict_to_DF------------------
   calories  duration  before  after
0       420        50     150    148
1       380        40     139    138
2       390        45     176    175
3    100000         1     200      1
----------Dict_to_DF_labels--------------
      calories  duration  before  after
day1       420        50     150    148
day2       380        40     139    138
day3       390        45     176    175
day4    100000         1     200      1

Examine DataFrame Rows

  • This code block allows you to select specific data variables.
print("-------Examine Selected Rows---------")
#use a list for multiple labels:
print(df.loc[["day1", "day3"]])

#refer to the row index:
print("--------Examine Single Row-----------")
print(df.loc["day1"])
-------Examine Selected Rows---------
      calories  duration  before  after
day1       420        50     150    148
day3       390        45     176    175
--------Examine Single Row-----------
calories    420
duration     50
before      150
after       148
Name: day1, dtype: int64

Pandas DataFrame Information

  • this code block uses the 'info' function to provide information like length of the data frame, data-types, and memory usage.
print(df.info())
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, day1 to day3
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   calories  3 non-null      int64
 1   duration  3 non-null      int64
 2   before    3 non-null      int64
 3   after     3 non-null      int64
dtypes: int64(4)
memory usage: 228.0+ bytes
None

Example of larger data set

Pandas can read CSV and many other types of files, run the following code to see more features with a larger data set

  • This code block uses the panda module to read the 'data.csv' file and output the data based on the duration of the workout.
import pandas as pd

#read csv and sort 'Duration' largest to smallest
df = pd.read_csv('files/data.csv').sort_values(by=['Duration'], ascending=False)

print("--Duration Top 10---------")
print(df.head(10))

print("--Duration Bottom 10------")
print(df.tail(10))
--Duration Top 10---------
     Duration  Pulse  Maxpulse  Calories
69        300    108       143    1500.2
79        270    100       131    1729.0
109       210    137       184    1860.4
60        210    108       160    1376.0
106       180     90       120     800.3
90        180    101       127     600.1
65        180     90       130     800.4
61        160    110       137    1034.4
62        160    109       135     853.0
67        150    107       130     816.0
--Duration Bottom 10------
     Duration  Pulse  Maxpulse  Calories
68         20    106       136     110.4
100        20     95       112      77.7
89         20     83       107      50.3
135        20    136       156     189.0
94         20    150       171     127.4
95         20    151       168     229.4
139        20    141       162     222.4
64         20    110       130     131.4
112        15    124       139     124.2
93         15     80       100      50.5

APIs are a Source for Writing Programs with Data

3rd Party APIs are a great source for creating Pandas Data Frames.

  • Data can be fetched and resulting json can be placed into a Data Frame
  • Observe output, this looks very similar to a Database
  • This code block gets data from a covid API and displays it below, it looks like a database because the information comes from a database and is displayed below.
'''Pandas can be used to analyze data'''
import pandas as pd
import requests

def fetch():
    '''Obtain data from an endpoint'''
    url = "https://flask.nighthawkcodingsociety.com/api/covid/"
    fetch = requests.get(url)
    json = fetch.json()

    # filter data for requirement
    df = pd.DataFrame(json['countries_stat'])  # filter endpoint for country stats
    print(df.loc[0:5, 'country_name':'deaths']) # show row 0 through 5 and columns country_name through deaths
    
fetch()
  country_name       cases     deaths
0          USA  82,649,779  1,018,316
1        India  43,057,545    522,193
2       Brazil  30,345,654    662,663
3       France  28,244,977    145,020
4      Germany  24,109,433    134,624
5           UK  21,933,206    173,352

Hacks

AP Prep

  • Add this Blog to you own Blogging site. In the Blog add notes and observations on each code cell.
  • In blog add College Board practice problems for 2.3.

The next 4 weeks, Teachers want you to improve your understanding of data. Look at the blog and others on Unit 2. Your intention is to find some things to differentiate your individual College Board project.

  • Create or Find your own dataset. The suggestion is to use a JSON file, integrating with your PBL project would be Fambulous.
When choosing a data set, think about the following:- Does it have a good sample size?    - Is there bias in the data?
- Does the data set need to be cleaned?
- What is the purpose of the data set?
- ...
  • Continue this Blog using Pandas extract info from that dataset (ex. max, min, mean, median, mode, etc.)

College board questions

Question answer
1. Which of the following is an advantage of a lossless compression algorithm over a lossy compression algorithm? B. A lossless compression algorithm can guarantee reconstruction of original data, while a lossy compression algorithm cannot.
2. A user wants to save a data file on an online storage site. The user wants to reduce the size of the file, if possible, and wants to be able to completely restore the file to its original version. Which of the following actions best supports the user’s needs? A. Compressing the file using a lossless compression algorithm before uploading it
3. A programmer is developing software for a social media platform. The programmer is planning to use compression when users send attachments to other users. Which of the following is a true statement about the use of compression? C. Lossy compression of an image file generally provides a greater reduction in transmission time than lossless compression does.

Yahoo Finance Dataset

This code uses the data from yahoo finance and functions from pandas to display the maximum, minimum, median, and mean stock prices from Apple in the last quarter. As well as displaying all stock data at the end.

import pandas as pd
import yfinance as yf

# Set the ticker symbol and date range
ticker = "AAPL"
start_date = "2022-12-01"
end_date = "2023-02-28"

# Get the stock data from Yahoo Finance
data = yf.download(ticker, start=start_date, end=end_date)

# Calculate the mean and median closing prices for the last quarter
mean_price = data["Close"].mean()
median_price = data["Close"].median()
min_price = data["Close"].min()
max_price = data["Close"].max()


# Print the results
print(f"Mean price for {ticker} in the last quarter: ${mean_price:.2f}")
print(f"Median price for {ticker} in the last quarter: ${median_price:.2f}")
print(f"Minimum price for {ticker} in the last quarter: ${min_price:.2f}")
print(f"Maximum price for {ticker} in the last quarter: ${max_price:.2f}")
print(f"data for {ticker} in the last quarter: ")
print(data["Close"])
[*********************100%***********************]  1 of 1 completed
Mean price for AAPL in the last quarter: $141.22
Median price for AAPL in the last quarter: $142.65
Minimum price for AAPL in the last quarter: $125.02
Maximum price for AAPL in the last quarter: $155.33
data for AAPL in the last quarter: 
Date
2022-12-01    148.309998
2022-12-02    147.809998
2022-12-05    146.630005
2022-12-06    142.910004
2022-12-07    140.940002
2022-12-08    142.649994
2022-12-09    142.160004
2022-12-12    144.490005
2022-12-13    145.470001
2022-12-14    143.210007
2022-12-15    136.500000
2022-12-16    134.509995
2022-12-19    132.369995
2022-12-20    132.300003
2022-12-21    135.449997
2022-12-22    132.229996
2022-12-23    131.860001
2022-12-27    130.029999
2022-12-28    126.040001
2022-12-29    129.610001
2022-12-30    129.929993
2023-01-03    125.070000
2023-01-04    126.360001
2023-01-05    125.019997
2023-01-06    129.619995
2023-01-09    130.149994
2023-01-10    130.729996
2023-01-11    133.490005
2023-01-12    133.410004
2023-01-13    134.759995
2023-01-17    135.940002
2023-01-18    135.210007
2023-01-19    135.270004
2023-01-20    137.869995
2023-01-23    141.110001
2023-01-24    142.529999
2023-01-25    141.860001
2023-01-26    143.960007
2023-01-27    145.929993
2023-01-30    143.000000
2023-01-31    144.289993
2023-02-01    145.429993
2023-02-02    150.820007
2023-02-03    154.500000
2023-02-06    151.729996
2023-02-07    154.649994
2023-02-08    151.919998
2023-02-09    150.869995
2023-02-10    151.009995
2023-02-13    153.850006
2023-02-14    153.199997
2023-02-15    155.330002
2023-02-16    153.710007
2023-02-17    152.550003
2023-02-21    148.479996
2023-02-22    148.910004
2023-02-23    149.399994
2023-02-24    146.710007
2023-02-27    147.919998
Name: Close, dtype: float64