Chr_小屋

Python爬取信息学在线题库


最近在研究用Python写爬虫,老师给了我一个网站练手,正好趁热打铁(笑)。

前言

  • 如侵删。

网页分析

开工

仅贴出关键代码,完整代码请移步GitHub

import requests
……
for i in rang(1,100,1):
    s=requests.session()url='http://lib.nbdp.net/paper/%d.html'  % j
    html=s.get(url).content
    html=str(html,encoding='utf-8',errors='ignore')

实测有一份试卷中出现了特殊符号,会导致错误,所以加了一句话重新编码并忽略错误。

from bs4 import BeautifulSoup
……
def download_exam():
……
    soup = BeautifulSoup(html,'lxml')
    exams=soup.find_all(name='div',attrs={'s':'math3'})
    for x in exams:
        out=analyzesoup(x)
        exam.append(out)
……
def analyzesoup(soupobj:bs4.element.NavigableString):
    def notNone(obj):
        if obj is None:
            return(False)
        return(True)
    result=soupobj.find(name='p',attrs={'class':'pt1'})#题干
    if notNone(result):
        tigan=result.get_text().strip()
        result=soupobj.find(name='li')#如果是选择题
        if notNone(result):
            xuanxiang=[]
            for i in soupobj.find_all(name='li'):
                xuanxiang.append(i.get_text())
            result=soupobj.find(attrs={'class':'col-md-3 column xz'})
            if notNone(result):
                daan=result.get_text()
            else:
                daan='未找到答案'
            out={'tg':tigan,'xx':xuanxiang,'da':daan}
            return(out)
……
import xlwt
……
def download_exam():
    workbook = xlwt.Workbook(encoding = 'utf-8')
    for j in range(1,102,1):
        title=str(j)+soup.title.get_text()
        worksheet = workbook.add_sheet(title)
……
        sheetwriter(exam,worksheet)
    workbook.save('dump.xls')
def sheetwriter(list,sheetobj):
    ……
    _row=1
    for item in list:
……
        if 'xx' in item:#选择题
            sheetobj.write(_row,1, label =item['tg'])
            sheetobj.write(_row,0, label =item['da'])
            col=2
            for i in item['xx']:
                sheetobj.write(_row,col, label =i)
                col+=1
            _row+=1
            continue
        if 'dm' in item:#填空题
            lines=item['tg'].splitlines(False)
            _row+=1
            for line in lines:
                sheetobj.write(_row,1, label =line)
                _row+=1
            lines=item['dm'].splitlines(False)
            for line in lines:
                sheetobj.write(_row,1, label =line)
                _row+=1
            continue
……

后记

当前页面是本站的「Google AMP」版。查看和发表评论请点击:完整版 »