Python爬取信息学奥赛一本通题库

2019-05-21T22:43:00


某一天老师给了我一个神秘链接,开工。

前言

网页分析

  • 首页满满的梦回2000年的感觉。
  • 上方的章节索引源码:
  • 随便进了2个不同的章节,发现URL没有改变:

    • 抓包分析后发现索引页的链接指向一个跳回首页的空白页,猜测是根据不同的Referer返回不同的页面。
    • 下方的题目页索引源码:
    • 题目页:
    • 题目页源码,结构比较简单:

    开工

    仅贴出关键代码,完整代码请移步GitHub

    • 我使用的环境:

      • Python3.6
      • beautifulsoup4==4.7.1
      • bs4==0.0.1
      • lxml==4.3.3
      • requests==2.21.0
      • xlwt==1.3.0
    • 解析大纲目录:
    def analyze_index():
        print('开始咯')
        workbook = xlwt.Workbook(encoding = 'utf-8')
        html = S.get(INDEX_URL).content
        html = str(html,encoding='utf-8',errors='ignore')
        soup = BeautifulSoup(html,'lxml')
        menu = soup.find(name='div',attrs={'class':'menuDiv'})
        for child in menu.find_all(attrs={'href': "#"}):
            pianming = fstr(child.h3.string.strip())
            print('爬取 %s' % pianming)
            zhanglist = []
            for li in child.parent.ul.find_all(name='li'):
                timulist = analyze_zhangindex(li.a.attrs['href'])
                zhangming = li.string
                print(' 爬取 %s' % zhangming)
                zhanglist.append((zhangming,timulist))
            sheetobj = workbook.add_sheet(pianming)
            print('保存数据……')
            write_to_sheet(pianming,zhanglist,sheetobj)
            workbook.save('dump.xls')
            print('写入完成')
        print('数据保存在 ./dump.xls 中')
    • 解析题目索引,2种形式分开处理:

    def analyze_zhangindex(url):
        S.get(BASE_URL + url)
        header = {
            'Referer': 'http://ybt.ssoier.cn:8088/' + url
            }
        html = S.get(BASE_URL,headers=header).content
        html = str(html,encoding='utf-8',errors='ignore')
        soup = BeautifulSoup(html,'lxml')
        table = soup.find(name='table',attrs={'class':'plist'})
        lieshu = len(table.find_all(name='th'))
        testlist = []
        if lieshu == 4:
            for td in table.find_all(name='td',attrs={'class':'xlist'}):
                pid = td.previous_sibling.string
                tm = td.string
                xq = analyze_testpage(pid)
                print('   爬取 [%s] %s' % (str(pid),tm))
                testlist.append((pid,tm,xq))
        elif lieshu == 8:
            def specifictd(tag): 
                if(tag.name == 'td'):
                    dict = tag.attrs
                    if(('class' in dict) and (dict['class'][0] == 'xlist')):
                        return(True)
                    elif(tag.find(name='font',attrs={'color':'#001290'})):
                        return(True)
                return(False)
            tds = table.find_all(specifictd)
            a = [tds[i] for i in range(0,len(tds),2)]
            b = [tds[i] for i in range(1,len(tds),2)]
            tds = a + b
            for td in tds:
                pid = td.previous_sibling.string
                tm = td.string
                if(pid):
                    xq = analyze_testpage(pid)
                    print('   爬取 [%s] %s' % (str(pid),tm))
                    testlist.append((pid,tm,xq))
                else:
                    print('  爬取 %s' % tm)
                    testlist.append((tm,))
        return(testlist)
    • 解析题目页,返回包含题目信息的字典:
    def analyze_testpage(pid):
        html = S.get(TEST_URL + str(pid)).content
        html = str(html,encoding='utf-8',errors='ignore')
        soup = BeautifulSoup(html,'lxml')
        try:
            td = soup.find(name='td',attrs={'class':'pcontent'})
            font = td.find(name='font',attrs={'size':'2'},recursive=False)
            tp = []
            for img in td.find_all(name='img'):
                tp.append(BASE_URL + img.attrs['src'])
        except AttributeError:
            print('***无法读取题目,题号:%s' % str(pid))
            return({'error':'题目不正常,或者是权限类题目!'})
        tm = []
        input = []
        output = []
        inputexp = []
        outputexp = []
        tip = []
        source = None
        flag = 0
        for tag in font.find_all(True):
            if(tag.name == 'font' or tag.name == 'div' or tag.name == 'br'):
                continue
            elif(tag.name == 'h3'):
                if(tag.string != '【来源】'):
                    flag+=1
                else:
                    flag = -1
                continue
            if(flag == 1):
                tm.append(fstr(tag.string))
            elif(flag == 2):
                input.append(fstr(tag.string))
            elif(flag == 3):
                output.append(fstr(tag.string))
            elif(flag == 4):
                try:
                    inputexp.extend(fstr(tag.string,'^$^$^$').split('^$^$^$'))
                except:
                    print('***读取题目信息遇到错误,题号 %s' %  str(pid))
            elif(flag == 5):
                try:
                    outputexp.extend(fstr(tag.string,'^$^$^$').split('^$^$^$'))
                except:
                    print('***读取题目信息遇到错误,题号 %s' %  str(pid))
            elif(flag == 6):
                tip.append(fstr(tag.string))
            elif(flag == -1):
                try:
                    if((tag.string).upper() != 'NO' and tag.string != '无'):
                        source = (BASE_URL + tag.attrs['href'],tag.string)
                        print('!!!!!!!!!!%s,%s' % (source[0],source[1]))
                finally:
                    break
        if(not tm):
            tm='无'
        if(not input):
            input=['无']
        if(not output):
            output=['无']
        if(not inputexp):
            inputexp=['无']
        if(not outputexp):
            outputexp=['无']
        test = {'tm':tm,'i':input,'o':output,'ie':inputexp,'oe':outputexp}
        if(tp):
            test['tp'] = tp
        if(tip):
            test['t'] = tip
        if(source):
            test['s'] = source
        return(test)
    • 最后用xlwt写入xls文件中即可。
    • 测试一下:

      • 输出了整个目录结构:
      • 下载的题目:

    后记

    • 这个网站还是很良心的,无广告,而且有在线代码提交,是个不错的学习平台。
    • 根据referer来返回不同的页面,这个做法第一回见,有点意思。
    • 最近在写小黑盒自动做任务脚本,不知道什么时候能做完(咕咕咕),已经完成了自动签到和点赞。
    当前页面是本站的「Baidu MIP」版。发表评论请点击:完整版 »