豆瓣电影分类排行榜-剧情片-爬虫_小吴不吃香菜的博客-爱代码爱编程
豆瓣电影分类排行榜 - 剧情片爬虫
Tips:
- 爬取的页面:https://movie.douban.com/typerank?type_name=%E5%89%A7%E6%83%85&type=11&interval_id=100:90&action=
- 发起get请求,通过对返回的json数据进行解析,选择自己所需的数据保存即可
- 而具体如何寻找构造请求的url以及相应参数,就不细讲了(写起来太麻烦了,大家读起来也累 doge)
import requests,json
class DouBan:
def __init__(self):
self.url = "https://movie.douban.com/j/chart/top_list?"
self.ua = {"user-agent": "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"}
def get_info(self,param):
resp = requests.get(url=self.url,
params = param,
headers=self.ua)
info = resp.json()
self.pares_infos(info)
# 该函数是为了后期进行可视化,剔除了“剧情”类型,自己爬虫练习时可以不调用该函数
def list_to_str(self,ls):
x = ""
for i in ls:
if i != "剧情":
x = x + i + " "
return x.strip(" ")
def pares_infos(self,infos):
ls = [["电影名称","评分","评价总数","排名","类型","地区","上映日期(大陆)","豆瓣主页","ID"]]
for i in infos:
ls.append([i["title"],i["score"],str(i["vote_count"]),str(i["rank"]),self.list_to_str(i["types"]),i["regions"][0],i["release_date"],i["url"],i["id"]])
# 将self.list_to_str(i["types"])更改为i["types"],即 不调用函数 list_to_str()
self.write(ls)
def write(self,ls):
f = open("豆瓣数据.csv","w",encoding="utf-8-sig")
for row in ls:
f.write(",".join(row) + "\n")
f.close()
def main(self):
limit = input("请输入爬取的数量:")
param = {
"type":"11",
"interval_id": "100:90",
"action":"",
"start":"0",
"limit":limit
}
self.get_info(param)
if __name__ == "__main__":
spider = DouBan()
spider.main()