猫眼电影top100（利用python爬取数据，R语言进行数据可视化）

爬虫的基本流程：

爬虫目标

1、 Python：从网页中提取top100电影的电影名称、封面图片、排名、演员、上映时间（地点）、评分等信息，另存为csv文本文件

2、 R：将爬取结果可视化分析

Python3.6代码

import requests
# 获取单个页面的函数
def get_one_page(url):
    try:
        headers = {
            "User-Agent":"Mozilla/5.0(Windows;U;Windows NT 6.0 x64;en-US;rv:1.9pre)Gecko/2008072421 Minefield/3.0.2pre"
        }
        response = requests.get(url, headers=headers)
        print(response.status_code)
        if response.status_code == 200:
            return response.text
        else:
            return None
    except RequestException:
        return None
    
import re
def parse_one_page(html):
    pattern = re.compile('<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>', re.S)
    items = re.findall(pattern, html)

    #print(html)
    for item in items:
        yield {  # 遍历迭代，并将数据整理形成字典
            'index': item[0],
            'image': item[1],  # 封面
            'name': item[2],
            'star': item[3].strip()[3:],  # 去除“主演：”
            'time': item[4].strip()[5:],  # 去除“上映时间：”
            'score': item[5].strip() + item[6].strip()  # 整数+小数
        }
#将数据写入文件
import csv
def write_to_csv(item):
    with open('猫眼top100.csv','a',encoding='utf_8_sig') as f:
        fieldnames=['index','image','name','star','time','score']
        w=csv.DictWriter(f,fieldnames=fieldnames)
        w.writerow(item)
        
        
def main(offset):
    url='https://maoyan.com/board/4?offset='+str(offset)
    html=get_one_page(url)
    
    for item in parse_one_page(html):
        print(item)
        write_to_csv(item)
        
if __name__=="__main__":
    for i in range(10):
        main(offset=i*10)

R语言3.5.1

#猫眼top100数据可视化

library(ggplot2)
top<-read.csv("C:\\Users\\admin\\Desktop\\猫眼top100.csv") #导入数据
View(top)

#1、电影评分最高的10部电影

top10<-top[order(top$评分,decreasing=TRUE),][1:10,] #按评分降序排列，选取前10部电影
ggplot(aes(x=电影名,y=评分),data=top10)+
  geom_bar(stat = "identity",position = "dodge",width=0.5,fill="#008FC4")+
  scale_y_continuous(expand=c(0,0))+ #强制让评分从0开始,不然标签和图像存在距离会很丑
  coord_flip()+
  geom_text(aes(label=评分),hjust=1.5,color="black")  #添加数据标签

#2、各国家电影数量比较

city<-top$上映地点
city<-as.vector(gsub("([ ])", "", city)) #删除字符串中的空格
n=length(city)
city_new<-rep(0,100)
for(i in 1:n){
  city_new[i]<-gsub('[(.*)]',"",city[i]) #删除字符串中的括号
}
city_new<-as.data.frame(table(city_new))

city_new[1,1]<-"未知"
city_new<-na.omit(city_new) #将未知上映国家的数据删除
#将法国戛纳标志为法国
city_new[3,2]<-city_new[3,2]+city_new[4,2]
city_new<-city_new[-4,]
colnames(city_new)<-c("group","freq") #各国家top100电影的频数分布表

library(ggplot2)
library(scales) #百分号percent函数
ggplot(city_new,aes(x="",y=freq,fill=group))+
  geom_bar(stat="identity",width=1)+
  geom_text(aes(y=freq/2+c(0,cumsum(freq)[-length(freq)]),label=percent(freq/sum(freq)))) +
  coord_polar(theta = "y")+
  scale_fill_brewer(palette =2)+
  theme_minimal()+
  theme(axis.title=element_blank(),
        axis.ticks=element_blank(),
        axis.text = element_blank(),
        legend.title = element_blank())

ggplot2绘制饼图，数据标签太难调整了（希望有大神指点一下）。

直接用pie函数的效果

分析：美国的优秀电影呈绝对优势领先，日本、韩国、法国、日本前100电影也较多。

#3、优秀电影作品集中的年份

#3.上映时间
date<-as.Date(top$上映时间)
year<-substr(date,1,4)
ydf<-as.data.frame(table(na.omit(year)))
library(ggplot2)
ggplot(aes(x=Var1,y=Freq,group=1),data = ydf)+geom_line()+
  geom_point()+labs(x="年份",y="数量",title="趋势图")+
  theme(axis.text.x = element_text(angle = 45) )

分析：由上趋势图可知，猫眼电影top100主要在20世纪末和21世纪初上映。

#4、拥有优秀电影作品数量较多的演员

利用excel数据分列的功能获取前100部电影所有演员的名字

actor<-read.table(file="clipboard",sep="\t",header = TRUE)
data<-as.data.frame(table(actor))
#加载词云包
library(devtools)
devtools :: install_github("lchiffon/wordcloud2")
library(wordcloud2)

figPath=system.file("examples/t.png",package = "wordcloud2")
wordcloud2(data,figPath=figPath,size=0.5,color="skyblue") #一般情况下，size越小轮廓越清晰

wordcloud2(data,figPath=figPath,size=0.3,color="skyblue")

为了把哥哥和星爷加上，只能牺牲美观性惹。。。

重要链接：mask and letterCloud silently fail · Issue #12 · Lchiffon/wordcloud2

本文分别用了条形图、饼图、折线图、词云图进行频数的展示（其实在Excel中处理方便快捷的多，超小声）~

练练手，有点粗糙，写了一整天，超认真の，希望知友多多支持！

编辑于 2019-04-26 · 著作权归作者所有

赞同 11