Unit 2.3 Extracting Information from Data, Pandas
Data connections, trends, and correlation. Pandas is introduced as it could be valuable for PBL, data validation, as well as understanding College Board Topics.
- Notes
- Pandas and DataFrames
- Cleaning Data
- Extracting Info
- Create your own DataFrame
- Example of larger data set
- APIs are a Source for Writing Programs with Data
- Hacks
- Yahoo Finance Dataset
Notes
- Pandas allows for easy organization and reading of data
- Datasets must be clean before feeding them to a computer
- this can be improved with pandas
- When creating or using a dataset you must consider;
- Does it have a good sample size?
- Is there bias in the data?
- Does the data set need to be cleaned?
- What is the purpose of the data set?
Pandas and DataFrames
In this lesson we will be exploring data analysis using Pandas.
- College Board talks about ideas like
- Tools. "the ability to process data depends on users capabilities and their tools"
- Combining Data. "combine county data sets"
- Status on Data"determining the artist with the greatest attendance during a particular month"
- Data poses challenge. "the need to clean data", "incomplete data"
- From Pandas Overview -- When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.
'''Pandas is used to gather data sets through its DataFrames implementation'''
import pandas as pd
df = pd.read_json('files/grade.json')
print(df)
# What part of the data set needs to be cleaned?
# The grades need to be checked, and the student id needs to not be "nil"
# From PBL learning, what is a good time to clean data? Hint, remember Garbage in, Garbage out?
# you should really check the input because bad input creates bad output
print(df[['GPA']])
print()
#try two columns and remove the index from print statement
print(df[['Student ID','GPA']].to_string(index=False))
print(df.sort_values(by=['GPA']))
print()
#sort the values in reverse order
print(df.sort_values(by=['GPA'], ascending=False))
print(df[df.GPA > 3.00])
print(df[df.GPA == df.GPA.max()])
print()
print(df[df.GPA == df.GPA.min()])
import pandas as pd
#the data can be stored as a python dictionary
dict = {
"calories": [420, 380, 390, 100000],
"duration": [50, 40, 45, 1],
"before": [150, 139, 176, 200],
"after": [148, 138, 175, 1]
}
#stores the data in a data frame
print("-------------Dict_to_DF------------------")
df = pd.DataFrame(dict)
print(df)
print("----------Dict_to_DF_labels--------------")
#or with the index argument, you can label rows.
df = pd.DataFrame(dict, index = ["day1", "day2", "day3", "day4"])
print(df)
print("-------Examine Selected Rows---------")
#use a list for multiple labels:
print(df.loc[["day1", "day3"]])
#refer to the row index:
print("--------Examine Single Row-----------")
print(df.loc["day1"])
print(df.info())
import pandas as pd
#read csv and sort 'Duration' largest to smallest
df = pd.read_csv('files/data.csv').sort_values(by=['Duration'], ascending=False)
print("--Duration Top 10---------")
print(df.head(10))
print("--Duration Bottom 10------")
print(df.tail(10))
APIs are a Source for Writing Programs with Data
3rd Party APIs are a great source for creating Pandas Data Frames.
- Data can be fetched and resulting json can be placed into a Data Frame
- Observe output, this looks very similar to a Database
- This code block gets data from a covid API and displays it below, it looks like a database because the information comes from a database and is displayed below.
'''Pandas can be used to analyze data'''
import pandas as pd
import requests
def fetch():
'''Obtain data from an endpoint'''
url = "https://flask.nighthawkcodingsociety.com/api/covid/"
fetch = requests.get(url)
json = fetch.json()
# filter data for requirement
df = pd.DataFrame(json['countries_stat']) # filter endpoint for country stats
print(df.loc[0:5, 'country_name':'deaths']) # show row 0 through 5 and columns country_name through deaths
fetch()
Hacks
AP Prep
- Add this Blog to you own Blogging site. In the Blog add notes and observations on each code cell.
- In blog add College Board practice problems for 2.3.
The next 4 weeks, Teachers want you to improve your understanding of data. Look at the blog and others on Unit 2. Your intention is to find some things to differentiate your individual College Board project.
- Create or Find your own dataset. The suggestion is to use a JSON file, integrating with your PBL project would be Fambulous.
When choosing a data set, think about the following:- Does it have a good sample size? - Is there bias in the data?
- Does the data set need to be cleaned?
- What is the purpose of the data set?
- ...
- Continue this Blog using Pandas extract info from that dataset (ex. max, min, mean, median, mode, etc.)
College board questions
Question | answer |
---|---|
1. Which of the following is an advantage of a lossless compression algorithm over a lossy compression algorithm? | B. A lossless compression algorithm can guarantee reconstruction of original data, while a lossy compression algorithm cannot. |
2. A user wants to save a data file on an online storage site. The user wants to reduce the size of the file, if possible, and wants to be able to completely restore the file to its original version. Which of the following actions best supports the user’s needs? | A. Compressing the file using a lossless compression algorithm before uploading it |
3. A programmer is developing software for a social media platform. The programmer is planning to use compression when users send attachments to other users. Which of the following is a true statement about the use of compression? | C. Lossy compression of an image file generally provides a greater reduction in transmission time than lossless compression does. |
import pandas as pd
import yfinance as yf
# Set the ticker symbol and date range
ticker = "AAPL"
start_date = "2022-12-01"
end_date = "2023-02-28"
# Get the stock data from Yahoo Finance
data = yf.download(ticker, start=start_date, end=end_date)
# Calculate the mean and median closing prices for the last quarter
mean_price = data["Close"].mean()
median_price = data["Close"].median()
min_price = data["Close"].min()
max_price = data["Close"].max()
# Print the results
print(f"Mean price for {ticker} in the last quarter: ${mean_price:.2f}")
print(f"Median price for {ticker} in the last quarter: ${median_price:.2f}")
print(f"Minimum price for {ticker} in the last quarter: ${min_price:.2f}")
print(f"Maximum price for {ticker} in the last quarter: ${max_price:.2f}")
print(f"data for {ticker} in the last quarter: ")
print(data["Close"])