wechatarticles package¶

Submodules¶

wechatarticles.AccountBiz module¶

class wechatarticles.AccountBiz.AccountBiz(cookie, token=None, method=None, t=120, proxies={'http': None, 'https': None})[源代码]¶

基类：object

通过公众号名称获取biz

微信公众号网页版、清博、西瓜 [‘xigu’, ‘qingbo’, ‘office’]

实测西瓜一次性可获取较多（目前西瓜的已废弃）

office(s, nickname_lst)[源代码]¶

qingbo(nickname_lst)[源代码]¶

run(nickname_lst)[源代码]¶

xigua(nickname_lst)[源代码]¶

wechatarticles.ArticlesAPI module¶

class wechatarticles.ArticlesAPI.ArticlesAPI(username=None, password=None, official_cookie=None, token=None, appmsg_token=None, wechat_cookie=None, outfile=None)[源代码]¶

基类：object

整合PublicAccountsWeb和ArticlesInfo，该API慎用，不再维护

complete_info(nickname, begin=0, count=5)[源代码]¶

获取公众号的抓取的文章文章信息

nickname: str: 公众号名称
begin: str or int: 起始爬取的页数
count: str or int: 每次爬取的数量，1-5

list:

由每个文章信息构成的数组:

[
    {
        'aid': '2650949647_1',
        'appmsgid': 2650949647,
        'comments': 文章评论信息
            {
                "base_resp": {
                    "errmsg": "ok",
                    "ret": 0
                },
                "elected_comment": [
                    {
                        "content": 用户评论文字,
                        "content_id": "6846263421277569047",
                        "create_time": 1520098511,
                        "id": 3,
                        "is_from_friend": 0,
                        "is_from_me": 0,
                        "is_top": 0, 是否被置顶
                        "like_id": 10001,
                        "like_num": 3,
                        "like_status": 0,
                        "logo_url": "http://wx.qlogo.cn/mmhead/OibRNdtlJdkFLMHYLMR92Lvq0PicDpJpbnaicP3Z6kVcCicLPVjCWbAA9w/132",
                        "my_id": 23,
                        "nick_name": 评论用户的名字,
                        "reply": {
                            "reply_list": [ ]
                        }
                    }
                ],
                "elected_comment_total_cnt": 3, 评论总数
                "enabled": 1,
                "friend_comment": [ ],
                "is_fans": 1,
                "logo_url": "http://wx.qlogo.cn/mmhead/Q3auHgzwzM6GAic0FAHOu9Gtv5lEu5kUqO6y6EjEFjAhuhUNIS7Y2AQ/132",
                "my_comment": [ ],
                "nick_name": 当前用户名,
                "only_fans_can_comment": false
            },
        'cover': 封面的url'digest': 文章摘要,
        'itemidx': 1,
        'like_num': 18, 文章点赞数
        'link': 文章的url,
        'read_num': 610, 文章阅读数
        'title': 文章标题,
        'update_time': 更新文章的时间戳
    },
]

如果list为空则说明没有相关文章

continue_info(nickname, begin=0)[源代码]¶

自动获取公众号的抓取的文章文章信息，直到爬取失败为止

nickname: str: 公众号名称
begin: str or int: 起始爬取的页数

list:

由每个文章信息构成的数组:

[
{
    'aid': '2650949647_1',
    'appmsgid': 2650949647,
    'comments': 文章评论信息
        {
            "base_resp": {
                "errmsg": "ok",
                "ret": 0
            },
            "elected_comment": [
                {
                    "content": 用户评论文字,
                    "content_id": "6846263421277569047",
                    "create_time": 1520098511,
                    "id": 3,
                    "is_from_friend": 0,
                    "is_from_me": 0,
                    "is_top": 0, 是否被置顶
                    "like_id": 10001,
                    "like_num": 3,
                    "like_status": 0,
                    "logo_url": "http://wx.qlogo.cn/mmhead/OibRNdtlJdkFLMHYLMR92Lvq0PicDpJpbnaicP3Z6kVcCicLPVjCWbAA9w/132",
                    "my_id": 23,
                    "nick_name": 评论用户的名字,
                    "reply": {
                        "reply_list": [ ]
                    }
                }
            ],
            "elected_comment_total_cnt": 3, 评论总数
            "enabled": 1,
            "friend_comment": [ ],
            "is_fans": 1,
            "logo_url": "http://wx.qlogo.cn/mmhead/Q3auHgzwzM6GAic0FAHOu9Gtv5lEu5kUqO6y6EjEFjAhuhUNIS7Y2AQ/132",
            "my_comment": [ ],
            "nick_name": 当前用户名,
            "only_fans_can_comment": false
        },
    'cover': 封面的url'digest': 文章摘要,
    'itemidx': 1,
    'like_num': 18, 文章点赞数
    'link': 文章的url,
    'read_num': 610, 文章阅读数
    'title': 文章标题,
    'update_time': 更新文章的时间戳
},
]

如果list为空则说明没有相关文章

wechatarticles.ArticlesInfo module¶

class wechatarticles.ArticlesInfo.ArticlesInfo(appmsg_token, cookie, proxies={'http': None, 'https': None})[源代码]¶

基类：object

登录WeChat，获取更加详细的推文信息。如点赞数、阅读数、评论等

comments(article_url)[源代码]¶

获取文章评论

article_url: str: 文章链接

json:

{
    "base_resp": {
        "errmsg": "ok",
        "ret": 0
    },
    "elected_comment": [
        {
            "content": 用户评论文字,
            "content_id": "6846263421277569047",
            "create_time": 1520098511,
            "id": 3,
            "is_from_friend": 0,
            "is_from_me": 0,
            "is_top": 0, 是否被置顶
            "like_id": 10001,
            "like_num": 3,
            "like_status": 0,
            "logo_url": "http://wx.qlogo.cn/mmhead/OibRNdtlJdkFLMHYLMR92Lvq0PicDpJpbnaicP3Z6kVcCicLPVjCWbAA9w/132",
            "my_id": 23,
            "nick_name": 评论用户的名字,
            "reply": {
                "reply_list": [ ]
            }
        }
    ],
    "elected_comment_total_cnt": 3, 评论总数
    "enabled": 1,
    "friend_comment": [ ],
    "is_fans": 1,
    "logo_url": "http://wx.qlogo.cn/mmhead/Q3auHgzwzM6GAic0FAHOu9Gtv5lEu5kUqO6y6EjEFjAhuhUNIS7Y2AQ/132",
    "my_comment": [ ],
    "nick_name": 当前用户名,
    "only_fans_can_comment": false
}

content(url)[源代码]¶

read_like_nums(article_url)[源代码]¶

获取阅读数和点赞数

article_url: str: 文章链接

(int, int):: 阅读数、点赞数

wechatarticles.ArticlesUrls module¶

class wechatarticles.ArticlesUrls.Mobile(biz, cookie)[源代码]¶

基类：object

通过移动端的wechat，获取需要爬取的微信公众号的推文链接

get_urls(appmsg_token, offset='0')[源代码]¶

appmsg_token: str: 个人微信号登陆后获取的token
offset: str or int: 获取起始的页数，从0开始，每次递增10（可以大于10，但是不好确认参数，所以递增10，之后再去重）

list: 由每个文章信息构成的数组:

[
    {
        'app_msg_ext_info': {
            'audio_fileid': 0,
            'author': '',
            'content': '',
            'content_url': 文章url，存在转义符'/'需要去除,
            'copyright_stat': 100,
            'cover': 文章封面url，存在转义符'/'需要去除,
            'del_flag': 1,
            'digest': '',
            'duration': 0,
            'fileid': 0,
            'is_multi': 0,
            'item_show_type': 8,
            'malicious_content_type': 0,
            'malicious_title_reason_id': 0,
            'multi_app_msg_item_list': [],
            'play_url': '',
            'source_url': '',
            'subtype': 9,
            'title': 文章标题
        },
        'comm_msg_info': {
            'content': '',
            'datetime': 1536930840,
            'fakeid': '2394588245',
            'id': 1000000262,
            'status': 2,
            'type': 49
        }
    }
]

class wechatarticles.ArticlesUrls.PC(biz, uin, cookie, proxies={'http': None, 'https': None})[源代码]¶

基类：object

通过PC端的微信，获取需要爬取的微信公众号的推文链接

get_urls(key, offset='0')[源代码]¶

key: str: 个人微信号登陆后获取的key
offset: str or int: 获取起始的页数，从0开始，每次递增10（可以大于10，但是不好确认参数，所以递增10，之后再去重）

list: 由每个文章信息构成的数组，主要获取的参数`item[‘app_msg_ext_info’][‘content_url’]`, item[‘app_msg_ext_info’][‘title’], item[‘comm_msg_info’][‘datetime’]:

import html
消除转义 html.unescape(html.unescape(url)); eval(repr(url).replace('\', ''))
[
    {
        'app_msg_ext_info': {
            'audio_fileid': 0,
            'author': '',
            'content': '',
            'content_url': 文章url，存在转义符'/'需要去除,
            'copyright_stat': 100,
            'cover': 文章封面url，存在转义符'/'需要去除,
            'del_flag': 1,
            'digest': '',
            'duration': 0,
            'fileid': 0,
            'is_multi': 0,
            'item_show_type': 8,
            'malicious_content_type': 0,
            'malicious_title_reason_id': 0,
            'multi_app_msg_item_list': [],
            'play_url': '',
            'source_url': '',
            'subtype': 9,
            'title': 文章标题
        },
        'comm_msg_info': {
            'content': '',
            'datetime': 1536930840,
            'fakeid': '2394588245',
            'id': 1000000262,
            'status': 2,
            'type': 49
        }
    }
]

class wechatarticles.ArticlesUrls.PublicAccountsWeb(cookie, token, proxies={'http': None, 'https': None})[源代码]¶

基类：object

通过微信公众号网页版抓取链接，或者公众号信息

articles_nums(nickname)[源代码]¶

获取公众号的总共发布的文章数量

nickname : str: 需要爬取公众号名称

int: 文章总数

get_urls(nickname, begin=0, count=5)[源代码]¶

获取公众号的每页的文章信息

nickname : str: 需要爬取公众号名称
begin: str or int: 起始爬取的页数
count: str or int: 每次爬取的数量，1-5

list:

由每个文章信息构成的数组:

[
{
    'aid': '2650949647_1',
    'appmsgid': 2650949647,
    'cover': 封面的url'digest': 文章摘要,
    'itemidx': 1,
    'link': 文章的url,
    'title': 文章标题,
    'update_time': 更新文章的时间戳
},
]

如果list为空则说明没有相关文章

latest_articles(biz)[源代码]¶

获取公众号的最新页的文章信息

biz : str: 公众号的biz

list:

由每个文章信息构成的数组:

[
{
    'aid': '2650949647_1',
    'appmsgid': 2650949647,
    'cover': 封面的url'digest': 文章摘要,
    'itemidx': 1,
    'link': 文章的url,
    'title': 文章标题,
    'update_time': 更新文章的时间戳
},
]

如果list为空则说明没有相关文章

official_info(nickname, begin=0, count=5)[源代码]¶

根据关键词返回相关公众号的信息

nickname : str: 需要爬取公众号名称
begin: str or int: 起始爬取的页数
count: str or int: 每次爬取的数量，1-5

list:

wechatarticles.Url2Html module¶

class wechatarticles.Url2Html.Url2Html(img_path=None)[源代码]¶

基类：object

根据微信文章链接下载为本地HTML文件

article_info(html)[源代码]¶

根据提供的html源码提取文章中的公众号和作者

html: str: 文章HTML源码

(str, str): 公众号名字和作者名字

download_img(url)[源代码]¶

url: str: 图片链接

str: 下载图片的本地路径

download_media(html, title)[源代码]¶

get_timestamp(html)[源代码]¶

根据提供的html源码提取文章发表的时间戳

html: str: 文章HTML源码

int: 文章发表的时间戳

get_title(html)[源代码]¶

根据提供的html源码提取文章中的标题

html: str: 文章HTML源码

str: 根据HTML获取文章标题

rename_title(title, html)[源代码]¶

replace_img(html)[源代码]¶

根据提供的html源码找出其中的图片链接，并对其进行替换

html: str: 文章HTML源码

str: 替换html中在线图片链接为本地图片路径

replace_name(title)[源代码]¶

对进行标题替换，确保标题符合windows的命名规则

title: str: 文章标题

str: 替换后的文章标题

run(url, mode, proxies={'http': None, 'https': None}, **kwargs)[源代码]¶

url: str: 微信文章链接
mode: int: 运行模式 1: 返回html源码，不下载图片 2: 返回html源码，下载图片但不替换图片路径 3: 返回html源码，下载图片且替换图片路径 4: 保存html源码，下载图片且替换图片路径 5: 保存html源码，下载图片且替换图片路径，并下载视频与音频
kwargs:: account: 公众号名 title: 文章名 date: 日期 proxies: 代理 img_path: 图片下载路径

str: HTML源码或消息

timestamp2date(timestamp)[源代码]¶

时间戳转日期

timestamp: int: 时间戳

str: 文章发表的日期，yyyy-mm-dd

wechatarticles.utils module¶

辅助脚本函数

wechatarticles.utils.copyright_num(copyright_stat)[源代码]¶

wechatarticles.utils.copyright_num_detailed(copyright_stat)[源代码]¶

wechatarticles.utils.end_func(timestamp, end_timestamp)[源代码]¶

wechatarticles.utils.flatten(x)[源代码]¶

wechatarticles.utils.get_history_urls(biz, uin, key, lst=[], start_timestamp=0, count=10, endcount=99999)[源代码]¶

wechatarticles.utils.read_nickname(fname)[源代码]¶

wechatarticles.utils.remove_duplicate_json(fname)[源代码]¶

wechatarticles.utils.save_f(fname)[源代码]¶

wechatarticles.utils.save_json(fname, data)[源代码]¶

保存数据为txt格式

fname: str: 保存为txt文件名
data: list: 爬取到的数据

None

wechatarticles.utils.save_mongo(data, host=None, port=None, name=None, password='', dbname=None, collname=None)[源代码]¶

存储数据到mongo

data: list: 需要插入的数据
host: str: 主机名(默认为本机数据库)
port: int: mongo所在主机开放的端口，默认为27017
username: str: 用户名
password: str: 用户密码
dbname: str: 远程连接的数据库名
collname: str: 需要插入的集合名(collection)

None

wechatarticles.utils.swap_biz_id(biz=None, fakeid=None)[源代码]¶

wechatarticles.utils.timestamp2date(timestamp)[源代码]¶

时间戳转换为日期

timestamp: int or str: 用户账号

datetime:: 转换好的日期：年-月-日时:分:秒

wechatarticles.utils.transfer_url(url)[源代码]¶

wechatarticles.utils.verify_url(article_url)[源代码]¶

wechatarticles package¶

Submodules¶

wechatarticles.AccountBiz module¶

wechatarticles.ArticlesAPI module¶

wechatarticles.ArticlesInfo module¶

wechatarticles.ArticlesUrls module¶

wechatarticles.Url2Html module¶

wechatarticles.utils module¶

Module contents¶

wechatarticles

导航

Related Topics