爬取mooc课程


测试代码如下

import requests
from lxml import etree
url='https://www.icourse163.org/channel/2001.htm'

headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36','cookie':'EDUWEBDEVICE=e3482c970bd946788fdf483d4ab9e30c; WM_TID=x0%2BN2esCvetAEBBUUAbUN2KavOhqDfWC; __yadk_uid=VZPexZY3Yjzj0OQQuIMYPpbk5gO1V1gT; hb_MA-A976-948FFA05E931_source=cn.bing.com; WM_NI=Fg4ObgviBfr0W7q5JD%2Bpd0idvjoSJy7i54T06MFqprUVn1EvnsJdw%2FJl6wtrmTq%2F9aimy3o0%2BZKBZ5umol9tA8HaIw3bP0M551Bl4zZiEFsfh2xwHSg7cSF0rXuC6rHlZmo%3D; WM_NIKE=9ca17ae2e6ffcda170e2e6eea6ed47ac8aa5d4d83aa3eb8ab7d14a978a8fb0d86a8dbebbb3c27ea7b489d2d22af0fea7c3b92a86b9fcb0e16db6b78e87f13c95e9bfd6c950e9b0acd1ec70b6b6a1b4c973f5b7fbd5d73cf6938a92f37a978aa2d2c87eaead8685f6258ab4fa98f77aaf8cbfd0bc6babb3b7d4ee3ce98685a3d93fb7b3a18dbc8083e88882cc7df1bea0d1d03ff7ef8e8db7729ab78ba7f26eaceeb898cb59a5e9adb4c53cb288bba4b57a8cb4968eee37e2a3; MOOC_PRIVACY_INFO_APPROVED=true; NTESSTUDYSI=f7c6f5a08dd94c4bbd2ba1aced7fb3ca; utm="eyJjIjoiIiwiY3QiOiIiLCJpIjoiIiwibSI6IiIsInMiOiIiLCJ0IjoiIn0=|aHR0cHM6Ly9jbi5iaW5nLmNvbS8="; Hm_lvt_77dc9a9d49448cf5e629e5bebaa5500b=1678779070,1678839599; Hm_lpvt_77dc9a9d49448cf5e629e5bebaa5500b=1678839602'}

resp=requests.get(url,headers=headers)
resp.encoding='utf_8'
print(resp.status_code)
titel=[]
uni=[]
i=1

sel=etree.HTML(resp.text)

很容易就爬到了网站源代码

但是找了很久都没有找到需要提取的数据
仔细查看课程列表页面的源代码的时候才发现原来MOOC上面的课程列表信息是通过js加载的数据,js需要浏览器才能加载
selenium便是一个无头浏览器。它俩的结合便可以达到更好的数据采集效率。不过在python中需要安装下selenium

安装 selenium

pip3 install selenium

安装chrome浏览器(如果没有的话)
https://www.google.cn/intl/zh-CN/chrome/

安装好之后就可以直接使用了

# -*- coding: UTF-8 -*-
from lxml import etree
url='https://www.icourse163.org/channel/2001.htm'

from selenium import webdriver

driver = webdriver.Chrome()
driver.get(url)

titel=[]
uni=[]
i = 1

text = str(driver.page_source)

sel=etree.HTML(text)
li_list=sel.xpath('/html/body/div[4]/div[2]/div/div/div/div[2]/div[2]/div/div/div[2]/div[1]/div')

print(li_list.__len__())
for li in li_list:
    new_titel=li.xpath('div/div[3]/div[1]/h3/text()')[0].strip()
    new_uni=li.xpath('div/div[3]/div[1]/p/text()')[0].strip()
    print('第'+str(i)+'个慕课标题:'+new_titel)
    print('第'+str(i)+'个慕课大学:'+new_uni)
    titel.append(new_titel)
    uni.append(new_uni)    
    i=i+1

现在源代码中就有了课程列表的信息


Author: morphotherain
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source morphotherain !
  TOC