请在下方输入要搜索的题目：

爬取贴吧http://tieba.baidu.com/p/3600458679单页中的qq邮箱

发布时间：2025-02-19 23:36:56

推荐参考答案 ( 由快搜搜题库官方老师解答 )

联系客服

答案：from itertools import chain from urllib.request import urlopen def getPageHtml(url): #获取网页的源码文件 obj = urlopen(url) return obj.read().decode('utf-8') #print(getPageHtml("http://tieba.baidu.com/p/3600458679")) '''<li class="l_reply_num" style="margin-left:8px" >1193回复贴，共26页</li>''' def getPagenum(text): #从源码文件中获取，总页数 pattern = r'(\d{0,3})' return re.findall(pattern,text)0] #text = getPageHtml("http://tieba.baidu.com/p/3600458679") #print(getPagenum(text)) '''http://tieba.baidu.com/p/3600458679?pn=2''' def getPageEMail(count): #对所有页数的文件挨个进行爬取，并利用正则表达式从源码中，匹配到信息 mails = ] for i in range(int(count)): url = "http://tieba.baidu.com/p/2314539885?pn=%d" %(i+1) text = getPageHtml(url) #<li class="d_name" data-field='{"user_id":1159023837}'> #811393332@qq.com pattern1 = r'\d{5,12}@qq\.com' print("正在爬取http://tieba.baidu.com/p/3600458679?pn=%d的内容" %(i+1)) print(re.findall(pattern1,text)) mails.append(re.findall(pattern1,text)) return mails def main(): text = getPageHtml("http://tieba.baidu.com/p/2314539885") count = getPagenum(text) email = getPageEMail(count) #chain 方法是对不同集合中的元素进行操作时，将不同的列表连接起来 with open("mails.txt",'w') as f: for i in chain(*email): f.write(i+"\n") main()

相关试题

社区工作者考试题库及答案国企考试题库银行招聘考试题库法律常识题库卫生法学题库社工考试题库小学语文面试试讲题库省考题库行测题库综合素质题库护理基础知识题库民法考试题库国家电网考试题库通用能力测试题库征信题库政治理论考试题库普通话题库教师资格证试题库行政管理题库及答案护理招聘考试题库