Dataquest学习总结[3]

Step 2: Intermediate Python And Pandas

Challenge: Summarizing Data   数据集地址:Github repository


对数据集进行处理:
#1.读文件
import pandas as pd
all_ages=pd.read_csv("all-ages.csv")
recent_grads=pd.read_csv("recent-grads.csv")
print(all_ages[:5])
print(recent_grads[:5])

#2.对两个数据集操作,同一个Major_category求得其Total值,给出两种实现方式:
#方式一
import numpy as np
aa_cat_counts = dict()
rg_cat_counts = dict()
get_table_1=all_ages.pivot_table(index="Major_category",values="Total",aggfunc=np.sum)
index_1=get_table_1.index.tolist()
get_table_2=recent_grads.pivot_table(index="Major_category",values="Total",aggfunc=np.sum)
index_2=get_table_2.index.tolist()
for i in range(len(get_table_1)):
    aa_cat_counts[index_1[i]]=get_table_1.iloc[i]
for i in range(len(get_table_2)):
    rg_cat_counts[index_2[i]]=get_table_2.iloc[i]
#方式二
aa_cat_counts = dict()
rg_cat_counts = dict()
def calculate_major_cat_totals(df):
    cats = df['Major_category'].unique()
    counts_dictionary = dict()
    for c in cats:
        major_df = df[df["Major_category"] == c]
        total = major_df["Total"].sum()
        counts_dictionary[c] = total
    return counts_dictionary
aa_cat_counts = calculate_major_cat_totals(all_ages)
rg_cat_counts = calculate_major_cat_totals(recent_grads)

#3.计算第二个数据集中低工资率
low_wage_percent = 0.0
low_wage_percent=recent_grads["Low_wage_jobs"].sum()/recent_grads["Total"].sum()
print(low_wage_percent)

#4.对比两个数据集,各个major下unemployment的好坏
# All majors, common to both DataFrames
majors = recent_grads['Major'].unique()
rg_lower_count = 0
for major in majors:
    al_rate=all_ages[all_ages["Major"]==major]["Unemployment_rate"].sum()
    rg_rate=recent_grads[recent_grads["Major"]==major]["Unemployment_rate"].sum()
    if al_rate>rg_rate:
        rg_lower_count+=1
print(rg_lower_count)

Pandas Internals: Series部分: 数据集:fandango_score_comparison.csv   地址: Github repository
包含这些信息:
FILM - Film name
RottenTomatoes - Average critic score on Rotten Tomatoes
RottenTomatoes_User - Average user score on Rotten Tomatoes
RT_norm - Average critic score on Rotten Tomatoes (normalized to a 0 to 5-point system)
RT_user-norm - Average user score on Rotten Tomatoes (normalized to a 0 to 5-point system)
Metacritic - Average critic score on Metacritic
Metacritic_User - Average user score on Metacritic

>>关于Series的某些操作:
Series. reindex()    sort_index()     sort_values()
# Add each value with each other
np.add(series_custom, series_custom)
# Apply sine function to each value
np.sin(series_custom)
# Return the highest value (will return a single value, not a Series)
np.max(series_custom)
Series操作:
import pandas as pd
fandango=pd.read_csv("fandango_score_comparison.csv")
print(fandango.head(2))

fandango = pd.read_csv('fandango_score_comparison.csv')
series_film=fandango["FILM"]
print(series_film[:5])
series_rt=fandango["RottenTomatoes"]
print(series_rt[:5])

# Import the Series object from pandas
from pandas import Series
film_names = series_film.values
rt_scores = series_rt.values
series_custom=Series(data=rt_scores,index=film_names)
fiveten = series_custom[5:11]
print(fiveten)

original_index = series_custom.index
sorted_by_index=series_custom.reindex(sorted(original_index))

sc2 = series_custom.sort_index()
sc3 = series_custom.sort_values()
print(sc2[0:10])
print(sc3[0:10])

rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])
rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])
rt_mean=(rt_critics +rt_users)/2


DataFrame操作:
import pandas as pd
fandango=pd.read_csv("fandango_score_comparison.csv")
print(fandango.head(2))
print(fandango.index)

fandango = pd.read_csv('fandango_score_comparison.csv')
first_last=fandango.iloc[[0,len(fandango)-1]]

fandango = pd.read_csv('fandango_score_comparison.csv')
fandango_films=fandango.set_index(inplace=False,drop=False,keys="FILM")
print(fandango_films.index)

import numpy as np
# returns the data types as a Series
types = fandango_films.dtypes
# filter data types to just floats, index attributes returns just column names
float_columns = types[types.values == 'float64'].index
# use bracket notation to filter columns to just float columns
float_df = fandango_films[float_columns]
# `x` is a Series object representing a column
deviations = float_df.apply(lambda x: np.std(x))
print(deviations)

rt_mt_user = float_df[['RT_user_norm', 'Metacritic_user_nom']]
rt_mt_means = rt_mt_user.apply(np.mean, axis=1)
print(rt_mt_means[0:5])



  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值