Python爬取信息学奥赛一本通题库

Chr

2019 年 05 月 21 日

3779 次浏览

暂无评论

6660字数

程序设计

老师发来一个神秘链接
某一天老师给了我一个神秘链接，开工。

前言

本项目开源： GitHub
题库版权归属 http://ybt.ssoier.cn:8088 所有，本文只做学习交流之用。
如侵删。

网页分析

首页满满的梦回2000年的感觉。
上方的章节索引源码：
随便进了2个不同的章节，发现URL没有改变：
抓包分析后发现索引页的链接指向一个跳回首页的空白页，猜测是根据不同的Referer返回不同的页面。
下方的题目页索引源码：
题目页：
题目页源码，结构比较简单：

开工

仅贴出关键代码，完整代码请移步GitHub

我使用的环境：
Python3.6
beautifulsoup4==4.7.1
bs4==0.0.1
lxml==4.3.3
requests==2.21.0
xlwt==1.3.0

解析大纲目录：

def analyze_index():
  print('开始咯')
  workbook = xlwt.Workbook(encoding = 'utf-8')
  html = S.get(INDEX_URL).content
  html = str(html,encoding='utf-8',errors='ignore')
  soup = BeautifulSoup(html,'lxml')
  menu = soup.find(name='div',attrs={'class':'menuDiv'})
  for child in menu.find_all(attrs={'href': "#"}):
      pianming = fstr(child.h3.string.strip())
      print('爬取 %s' % pianming)
      zhanglist = []
      for li in child.parent.ul.find_all(name='li'):
          timulist = analyze_zhangindex(li.a.attrs['href'])
          zhangming = li.string
          print(' 爬取 %s' % zhangming)
          zhanglist.append((zhangming,timulist))
      sheetobj = workbook.add_sheet(pianming)
      print('保存数据……')
      write_to_sheet(pianming,zhanglist,sheetobj)
      workbook.save('dump.xls')
      print('写入完成')
  print('数据保存在 ./dump.xls 中')

解析题目索引，2种形式分开处理：
索引形式1
索引形式2

def analyze_zhangindex(url):
  S.get(BASE_URL + url)
  header = {
      'Referer': 'http://ybt.ssoier.cn:8088/' + url
      }
  html = S.get(BASE_URL,headers=header).content
  html = str(html,encoding='utf-8',errors='ignore')
  soup = BeautifulSoup(html,'lxml')
  table = soup.find(name='table',attrs={'class':'plist'})
  lieshu = len(table.find_all(name='th'))
  testlist = []
  if lieshu == 4:
      for td in table.find_all(name='td',attrs={'class':'xlist'}):
          pid = td.previous_sibling.string
          tm = td.string
          xq = analyze_testpage(pid)
          print('   爬取 [%s] %s' % (str(pid),tm))
          testlist.append((pid,tm,xq))
  elif lieshu == 8:
      def specifictd(tag): 
          if(tag.name == 'td'):
              dict = tag.attrs
              if(('class' in dict) and (dict['class'][0] == 'xlist')):
                  return(True)
              elif(tag.find(name='font',attrs={'color':'#001290'})):
                  return(True)
          return(False)
      tds = table.find_all(specifictd)
      a = [tds[i] for i in range(0,len(tds),2)]
      b = [tds[i] for i in range(1,len(tds),2)]
      tds = a + b
      for td in tds:
          pid = td.previous_sibling.string
          tm = td.string
          if(pid):
              xq = analyze_testpage(pid)
              print('   爬取 [%s] %s' % (str(pid),tm))
              testlist.append((pid,tm,xq))
          else:
              print('  爬取 %s' % tm)
              testlist.append((tm,))
  return(testlist)

解析题目页，返回包含题目信息的字典：

def analyze_testpage(pid):
  html = S.get(TEST_URL + str(pid)).content
  html = str(html,encoding='utf-8',errors='ignore')
  soup = BeautifulSoup(html,'lxml')
  try:
      td = soup.find(name='td',attrs={'class':'pcontent'})
      font = td.find(name='font',attrs={'size':'2'},recursive=False)
      tp = []
      for img in td.find_all(name='img'):
          tp.append(BASE_URL + img.attrs['src'])
  except AttributeError:
      print('***无法读取题目，题号：%s' % str(pid))
      return({'error':'题目不正常，或者是权限类题目!'})
  tm = []
  input = []
  output = []
  inputexp = []
  outputexp = []
  tip = []
  source = None
  flag = 0
  for tag in font.find_all(True):
      if(tag.name == 'font' or tag.name == 'div' or tag.name == 'br'):
          continue
      elif(tag.name == 'h3'):
          if(tag.string != '【来源】'):
              flag+=1
          else:
              flag = -1
          continue
      if(flag == 1):
          tm.append(fstr(tag.string))
      elif(flag == 2):
          input.append(fstr(tag.string))
      elif(flag == 3):
          output.append(fstr(tag.string))
      elif(flag == 4):
          try:
              inputexp.extend(fstr(tag.string,'^$^$^$').split('^$^$^$'))
          except:
              print('***读取题目信息遇到错误，题号 %s' %  str(pid))
      elif(flag == 5):
          try:
              outputexp.extend(fstr(tag.string,'^$^$^$').split('^$^$^$'))
          except:
              print('***读取题目信息遇到错误，题号 %s' %  str(pid))
      elif(flag == 6):
          tip.append(fstr(tag.string))
      elif(flag == -1):
          try:
              if((tag.string).upper() != 'NO' and tag.string != '无'):
                  source = (BASE_URL + tag.attrs['href'],tag.string)
                  print('!!!!!!!!!!%s,%s' % (source[0],source[1]))
          finally:
              break
  if(not tm):
      tm='无'
  if(not input):
      input=['无']
  if(not output):
      output=['无']
  if(not inputexp):
      inputexp=['无']
  if(not outputexp):
      outputexp=['无']
  test = {'tm':tm,'i':input,'o':output,'ie':inputexp,'oe':outputexp}
  if(tp):
      test['tp'] = tp
  if(tip):
      test['t'] = tip
  if(source):
      test['s'] = source
  return(test)