没赞了。。。。批量获取百度站长工具的索引量

*发表于 2014-6-28 23:19:01* · 发表于 2014-6-28 23:19:01

本帖最后由 GoGo闯于 2014-6-28 23:20 编辑

没赞了听课了。。。。。。
---------------------------------------------------------
由于页面类型较多，挨个抽查收录消耗的时间太长，百度site+inurl出来的数值扯淡的离谱。所以想到用‘百度索引量/页面总量’来估算收录情况，但添加到索引量工具的URL总共有150个左右，挨个查感觉很不爽，所以就整了个脚本。页面总量可以通过读数据库字段来获取，或者sitemap，如果sitemap生成规则是仅有内容的页面写进sitemap，可以理解为有效页面总量。

有些URL得做减法才能得出真实的索引量，比如：'/abc/*.html'包含'/abc/123.html 和 /abc/122p1.html'

#coding:utf-8
#百度站长工具，获取目录索引量
import pycurl
import StringIO
import re
import urllib
head = [' 登陆百度站长工具，在进入查看索引量的页面，把cookie放在这 ']
url_re = 'http://zhanzhang.baidu.com/indexs/index?site=http://www.domain.com' #domain换成自己的
req = re.compile(r'ruleid="(.*?)"><span.*?rule_suffix">(.*?)</span></td>')
req_data = re.compile(r'{"ctime":".*?","total":"(.*?)","diff":.*?}')
def getHtml(url):
crl = pycurl.Curl()
crl.setopt(pycurl.FOLLOWLOCATION,1)
crl.setopt(pycurl.MAXREDIRS,5)
crl.setopt(pycurl.HTTPHEADER,head)
crl.fp = StringIO.StringIO()
crl.setopt(pycurl.URL, url)
crl.setopt(crl.WRITEFUNCTION, crl.fp.write)
crl.perform()
html = crl.fp.getvalue()
return html
html = getHtml(url_re)
data = re.findall(req,html)
for line in data:
x = list(line)[0]
y = list(line)[1]
#post = urllib.urlencode({
#'site':'http://www.domain.com/',
#'id':'12027262406916504100',
#'range':'month',
#'page':'1',
#'pagesize':'5'
#})
#domain换成自己的
url = 'http://zhanzhang.baidu.com/indexs/list?site=http://www.domain.com/&id=%s&range=month&page=1&pagesize=5' % x
html_data = getHtml(url)
data = re.findall(req_data,html_data)
print y,data[0]

复制代码

另外附带一份批量添加URL到百度索引工具的脚本，手动添加100多url也很蛋疼的。。。。

#coding:utf-8
import urllib
import urllib2
#import cookielib
hosturl = 'http://zhanzhang.baidu.com/indexs/addrule?site=http://www.domain.com/' #换domain
#cj = cookielib.LWPCookieJar()
#cookie_support = urllib2.HTTPCookieProcessor(cj)
#opener = urllib2.build_opener(cookie_support, urllib2.HTTPHandler)
#urllib2.install_opener(opener)
#h = urllib2.urlopen(hosturl)
headers = {
"Cookie":" 自己添加一个url，把cookie放在这",
"Host":"zhanzhang.baidu.com",
"Origin":"http://zhanzhang.baidu.com",
"Referer":"http://zhanzhang.baidu.com/indexs/index?site=http://www.domain.com/", #换domain
"X-Request-By":"baidu.ajax",
"X-Requested-With":"XMLHttpRequest"
}
def postHtml(url):
request = urllib2.Request(hosturl, postData, headers)
response = urllib2.urlopen(request)
text = response.read()
return text
#准备一个名为‘post_index.txt’的文件，里面以tab分两列，分别对应百度索引量工具添加url要填写的两部分内容，例子：
#* abc*p*.html
#* a*.html
#* b*c*/?*
for x in open('post_index.txt'):
prefix = x.split(' ')[0]
suffix = x.split(' ')[1]
postData = {
"prefix":prefix,
"suffix":suffix
}
postData = urllib.urlencode(postData)
html = postHtml(hosturl)
print html

复制代码

*发表于 2014-6-30 14:31:20* · 发表于 2014-6-30 14:31:20

请教一个问题，在查看cookie后是将那部分（名称，域。。。）复制到代码中啦？

*发表于 2014-6-30 14:42:49* · 发表于 2014-6-30 14:42:49

SEO小橙发表于 2014-6-30 14:31
请教一个问题，在查看cookie后是将那部分（名称，域。。。）复制到代码中啦？ ...

"Cookie":" 自己添加一个url，把cookie放在这",
翻译一下

"Cookie":" http:www.xxx.com，cookie值",

*发表于 2014-6-30 14:53:43* · 发表于 2014-6-30 14:53:43

看不懂的路过。。。。

*发表于 2014-7-2 09:12:45* · 发表于 2014-7-2 09:12:45

行书发表于 2014-6-30 06:42
"Cookie":" 自己添加一个url，把cookie放在这",
翻译一下

请问这个cookie值是在浏览器工具的cookie中查看吗？

*发表于 2014-7-2 09:39:00* · 发表于 2014-7-2 09:39:00

SEO小橙发表于 2014-7-2 01:12
请问这个cookie值是在浏览器工具的cookie中查看吗？

好像是
先百度不懂PMzero 我没了解过cookie 所以不做参与

楼主| *发表于 2014-7-2 09:59:15* · 发表于 2014-7-2 09:59:15

SEO小橙发表于 2014-7-2 01:12
请问这个cookie值是在浏览器工具的cookie中查看吗？

是啊..........................................

*发表于 2014-8-5 10:02:13* · 发表于 2014-8-5 10:02:13

收藏了。

*发表于 2014-8-5 18:30:45* · 发表于 2014-8-5 18:30:45

【伪原创】方便懒人用闯哥莫怪
【其实我也想混点分。。。。。。。】

#encoding=utf-8
import urllib2,urllib,re
import cookielib
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
URL_BAIDU_INDEX = u'http://www.baidu.com/';
URL_BAIDU_TOKEN = 'https://passport.baidu.com/v2/api/?getapi&tpl=pp&apiver=v3&class=login';
URL_BAIDU_LOGIN = 'https://passport.baidu.com/v2/api/?login';
#设置用户名、密码
username = '账号';
password = '密码';
#设置cookie，这里cookiejar可自动管理，无需手动指定
cj = cookielib.CookieJar();
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj));
urllib2.install_opener(opener);
reqReturn = urllib2.urlopen(URL_BAIDU_INDEX);
#获取token,
tokenReturn = urllib2.urlopen(URL_BAIDU_TOKEN);
matchVal = re.search(u'"token" : "(?P<tokenVal>.*?)"',tokenReturn.read());
tokenVal = matchVal.group('tokenVal');
#构造登录请求参数，该请求数据是通过抓包获得，对应https://passport.baidu.com/v2/api/?login请求
postData = {
'username' : username,
'password' : password,
'u' : 'https://passport.baidu.com/',
'tpl' : 'pp',
'token' : tokenVal,
'staticpage' : 'https://passport.baidu.com/static/passpc-account/html/v3Jump.html',
'isPhone' : 'false',
'charset' : 'UTF-8',
'callback' : 'parent.bd__pcbs__ra48vi'
};
postData = urllib.urlencode(postData);
#发送登录请求
loginRequest = urllib2.Request(URL_BAIDU_LOGIN,postData);
loginRequest.add_header('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8');
loginRequest.add_header('Accept-Encoding','gzip,deflate,sdch');
loginRequest.add_header('Accept-Language','zh-CN,zh;q=0.8');
loginRequest.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko');
loginRequest.add_header('Content-Type','application/x-www-form-urlencoded');
sendPost = urllib2.urlopen(loginRequest);
#测试登陆
#houtai = 'http://zhanzhang.baidu.com/crawltools/index'
# content = urllib2.urlopen(houtai)
#print content.getcode()#测试登陆
#传递数据
def psot(http,prefix,suffix):
login_data ={
"prefix":prefix,
"suffix":suffix
}
print login_data
url='http://zhanzhang.baidu.com/indexs/addrule?site=%s'%http
headers={
'X-Request-By':'baidu.ajax',#省略也是可以的
'X-Requested-With':'XMLHttpRequest',
'Referer':'http://zhanzhang.baidu.com/indexs/index?site=%s'%http,
}
data= urllib.urlencode(login_data)
req=urllib2.Request(url,data,headers)
try:
reason=urllib2.urlopen(req)
a=reason.read()
reason.close()
except Exception, e:
print e
#准备一个名为‘post_index.txt’的文件，里面以tab分两列，分别对应百度索引量工具添加url要填写的两部分内容，例子：
#* abc*p*.html
#* a*.html
#* b*c*/?*
def main(http):
for x in open('c://1//post_index.txt','r'):
prefix = x.strip().split('\t')[0]
suffix= x.strip().split('\t')[1]
psot(http,prefix,suffix)
if __name__ == '__main__':
main(http="http://网址")

复制代码

帐号		自动登录	找回密码
密码			注册

没赞了。。。。批量获取百度站长工具的索引量

评分

评分