某一天老师给了我一个神秘链接,开工。
前言
- 本项目开源: GitHub
- 题库版权归属 http://ybt.ssoier.cn:8088 所有,本文只做学习交流之用。
- 如侵删。
网页分析
-
首页满满的梦回2000年的感觉。
-
上方的章节索引源码:
-
随便进了2个不同的章节,发现URL没有改变:
-
抓包分析后发现索引页的链接指向一个跳回首页的空白页,猜测是根据不同的Referer返回不同的页面。
-
下方的题目页索引源码:
-
题目页:
-
题目页源码,结构比较简单:
开工
仅贴出关键代码,完整代码请移步GitHub
-
我使用的环境:
-
Python3.6
-
beautifulsoup4==4.7.1
-
bs4==0.0.1
-
lxml==4.3.3
-
requests==2.21.0
-
xlwt==1.3.0
-
解析大纲目录:
def analyze_index(): print('开始咯') workbook = xlwt.Workbook(encoding = 'utf-8') html = S.get(INDEX_URL).content html = str(html,encoding='utf-8',errors='ignore') soup = BeautifulSoup(html,'lxml') menu = soup.find(name='div',attrs={'class':'menuDiv'}) for child in menu.find_all(attrs={'href': "#"}): pianming = fstr(child.h3.string.strip()) print('爬取 %s' % pianming) zhanglist = [] for li in child.parent.ul.find_all(name='li'): timulist = analyze_zhangindex(li.a.attrs['href']) zhangming = li.string print(' 爬取 %s' % zhangming) zhanglist.append((zhangming,timulist)) sheetobj = workbook.add_sheet(pianming) print('保存数据……') write_to_sheet(pianming,zhanglist,sheetobj) workbook.save('dump.xls') print('写入完成') print('数据保存在 ./dump.xls 中')
-
解析题目索引,2种形式分开处理:
def analyze_zhangindex(url): S.get(BASE_URL + url) header = { 'Referer': 'http://ybt.ssoier.cn:8088/' + url } html = S.get(BASE_URL,headers=header).content html = str(html,encoding='utf-8',errors='ignore') soup = BeautifulSoup(html,'lxml') table = soup.find(name='table',attrs={'class':'plist'}) lieshu = len(table.find_all(name='th')) testlist = [] if lieshu == 4: for td in table.find_all(name='td',attrs={'class':'xlist'}): pid = td.previous_sibling.string tm = td.string xq = analyze_testpage(pid) print(' 爬取 [%s] %s' % (str(pid),tm)) testlist.append((pid,tm,xq)) elif lieshu == 8: def specifictd(tag): if(tag.name == 'td'): dict = tag.attrs if(('class' in dict) and (dict['class'][0] == 'xlist')): return(True) elif(tag.find(name='font',attrs={'color':'#001290'})): return(True) return(False) tds = table.find_all(specifictd) a = [tds[i] for i in range(0,len(tds),2)] b = [tds[i] for i in range(1,len(tds),2)] tds = a + b for td in tds: pid = td.previous_sibling.string tm = td.string if(pid): xq = analyze_testpage(pid) print(' 爬取 [%s] %s' % (str(pid),tm)) testlist.append((pid,tm,xq)) else: print(' 爬取 %s' % tm) testlist.append((tm,)) return(testlist)
-
解析题目页,返回包含题目信息的字典:
def analyze_testpage(pid): html = S.get(TEST_URL + str(pid)).content html = str(html,encoding='utf-8',errors='ignore') soup = BeautifulSoup(html,'lxml') try: td = soup.find(name='td',attrs={'class':'pcontent'}) font = td.find(name='font',attrs={'size':'2'},recursive=False) tp = [] for img in td.find_all(name='img'): tp.append(BASE_URL + img.attrs['src']) except AttributeError: print('***无法读取题目,题号:%s' % str(pid)) return({'error':'题目不正常,或者是权限类题目!'}) tm = [] input = [] output = [] inputexp = [] outputexp = [] tip = [] source = None flag = 0 for tag in font.find_all(True): if(tag.name == 'font' or tag.name == 'div' or tag.name == 'br'): continue elif(tag.name == 'h3'): if(tag.string != '【来源】'): flag+=1 else: flag = -1 continue if(flag == 1): tm.append(fstr(tag.string)) elif(flag == 2): input.append(fstr(tag.string)) elif(flag == 3): output.append(fstr(tag.string)) elif(flag == 4): try: inputexp.extend(fstr(tag.string,'^$^$^$').split('^$^$^$')) except: print('***读取题目信息遇到错误,题号 %s' % str(pid)) elif(flag == 5): try: outputexp.extend(fstr(tag.string,'^$^$^$').split('^$^$^$')) except: print('***读取题目信息遇到错误,题号 %s' % str(pid)) elif(flag == 6): tip.append(fstr(tag.string)) elif(flag == -1): try: if((tag.string).upper() != 'NO' and tag.string != '无'): source = (BASE_URL + tag.attrs['href'],tag.string) print('!!!!!!!!!!%s,%s' % (source[0],source[1])) finally: break if(not tm): tm='无' if(not input): input=['无'] if(not output): output=['无'] if(not inputexp): inputexp=['无'] if(not outputexp): outputexp=['无'] test = {'tm':tm,'i':input,'o':output,'ie':inputexp,'oe':outputexp} if(tp): test['tp'] = tp if(tip): test['t'] = tip if(source): test['s'] = source return(test)
-
最后用xlwt写入xls文件中即可。
-
测试一下:
-
输出了整个目录结构:
-
下载的题目:
后记
-
这个网站还是很良心的,无广告,而且有在线代码提交,是个不错的学习平台。
-
根据referer来返回不同的页面,这个做法第一回见,有点意思。
-
最近在写小黑盒自动做任务脚本,不知道什么时候能做完(咕咕咕),已经完成了自动签到和点赞。
本文链接:https://blog.chrxw.com/archives/2019/05/21/241.html
转载请保留本文链接,谢谢