爬取Reddit收藏夹-油猴中文网 - Powered by Discuz! Archiver

朱焱伟 发表于 2024-6-8 04:31:23

爬取Reddit收藏夹

本帖最后由朱焱伟于 2024-8-11 18:06 编辑

# 爬取Reddit收藏夹

## 目的

如何保存Reddit上收藏夹和点赞过的所有帖子？

## saveddit使用

经过对比尝试，发现(https://github.com/p-ranav/saveddit)较好，接下来说明使用`saveddit`爬取reddit收藏夹的过程。

### 创建python3环境

```bash
conda create -n py310 python=3.10.6
conda activate py310
conda update -n base -c defaults conda
```

### 安装saveddit

`conda install saveddit`或`pip3 install saveddit`。
详见(https://github.com/p-ranav/saveddit)说明。

### 创建reddit脚本应用

在[https://www.reddit.com/prefs/apps](https://www.reddit.com/prefs/apps)创建Script脚本类型的应用，以便下面使用其应用id和密钥来调用Reddit API

### 填写认证信息

安装好saveddit后，应该已经有了yaml格式的配置文件，来打开编辑。

```bash
vim ~/.saveddit/user_config.yaml
```

配置文件里，需要按[https://www.reddit.com/prefs/apps](https://www.reddit.com/prefs/apps)网页上自己创建的应用情况，在yaml里替换配置为你的Reddit API认证信息。

```
reddit_client_id: '你的脚本应用id'
reddit_client_secret: '你的脚本应用密钥'
reddit_username: '你的reddit用户名'
```

### 测试运行

至此，应该可以运行，爬一下自己点赞的前几条记录测试下。

```bash
saveddit user users 你的reddit用户名 upvoted -l 5
```

### 防盗链图片

```
* Processing `https://i.redd.it/6h97arsgih171.jpg`
* This is a direct link to a jpeg file
- <urlopen error EOF occurred in violation of protocol (_ssl.c:997)>
* Saving submission.json
* Saving top-level comments to comments.json
- 100%|████████████████████| 8/8
* Saved to output/www.reddit.com/u/XXX/upvoted/001_XXX
```

之前的运行确实可以下载一些帖子的内容（包括图片和评论），但如果遇到`i.redd.it
`开头的一些图片链接，就下载不下来，比如(https://i.redd.it/6h97arsgih171.jpg)。

```
`urllib.request.urlretrieve(submission.url, output_path)`
这行代码会报错`<urlopen error EOF occurred in violation of protocol (_ssl.c:997)>`
```

可以查看saveddit源码看看是什么原因，首先定位saveddit安装位置：

```python
conda activate py310
whereis saveddit
pip3 show saveddit
```

从上面输出可以看到saveddit安装位置，我们定位一下要修改的源码的位置，比如：

```bash
vim ~/anaconda3/envs/py310/lib/python3.10/site-packages/saveddit/submission_downloader.py
```

可以看到是`urllib.request.urlretrieve(submission.url, output_path)`这句话下载`i.redd.it`开头的图片链接会有ssl相关报错。具体怎么改这个urllib里的urlretrieve传参，我没搞明白。不过试了一下curl可以正常下载图片，那就把urllib.request.urlretrieve用运行curl命令的办法替换。修改代码`saveddit/submission_downloader.py`

```python
# def download_direct_link(self, submission, output_path):
# try:
#       urllib.request.urlretrieve(submission.url, output_path)
# except Exception as e:
#       self.print_formatted_error(e)

def curl_download(self,url, output_path):
   import subprocess
   curl_command = ['curl', '-o', output_path, url]
   subprocess.run(curl_command)

def download_direct_link(self, submission, output_path):
   try:
         self.curl_download(submission.url, output_path)
         # urllib.request.urlretrieve(submission.url, output_path)
   except Exception as e:
         self.print_formatted_error(e)
```

试了下，这样可以正常下载了。能跑就行，那就继续凑合用了。

### (https://github.com/mikf/gallery-dl)

遇到一些`https://i.redd.it/`开头的链接，简单地curl可能下不下来，可以换用gallery-dl来替换。
```bash
python3 -m pip install -U gallery-dl
whereis gallery-dl
gallery-dl https://i.redd.it/zbthk3rrb57d1.jpeg
gallery-dl -D . https://i.redd.it/zbthk3rrb57d1.jpeg
```

```python
def curl_download(self,url, output_path):
home_directory = os.path.expanduser('~')
gallery_dl_path = os.path.join(home_directory, 'anaconda3/envs/py310/bin/gallery-dl')
output_folder = output_path.split('/files')+'/files'
curl_command =
result = subprocess.run(curl_command, capture_output=True, text=True)
if result.returncode == 0:
   print("Download successful!")
else:
   print(f"Error: {result.stderr}")
```

- (https://github.com/mikf/gallery-dl/blob/master/docs/options.md)

```
-d, --destination PATH    Target location for file downloads
-D, --directory PATH    Exact location for file downloads
```

gallery-dl自身也支持不少网站，详见(https://github.com/mikf/gallery-dl/blob/master/docs/supportedsites.md)，有兴趣可以深入♂探索。

### 保存学习资料

那就回到最初的目标，爬一下自己收藏夹和点赞过的帖子，其他具体用法看(https://github.com/p-ranav/saveddit)官方说明。

```bash
conda activate py310
saveddit user users 你的Reddit用户名 saved -l 1000000 -o output 2>&1 | tee log.saved.txt
saveddit user users 你的Reddit用户名 upvoted -l 1000000 -o output 2>&1 | tee log.upvoted.txt
```

## 相关链接

- (https://github.com/p-ranav/saveddit) Bulk Downloader for Reddit
- [https://www.reddit.com/prefs/apps](https://www.reddit.com/prefs/apps) 选script类型
- (https://github.com/nopperl/load-reddit-images-directly)
- (https://github.com/mikf/gallery-dl)
- (https://github.com/MalloyDelacroix/DownloaderForReddit)
- (https://github.com/aliparlakci/bulk-downloader-for-reddit) Downloads and archives content from reddit
- (https://github.com/BlipRanger/bdfr-html) Converts the output of the bulk downloader for reddit to a set of HTML pages.
- (https://www.cnblogs.com/czx1/p/11442442.html)
- (https://www.reddit.com/r/DataHoarder/comments/xdymfa/has_anyone_used_bulk_downloader_for_reddit/)
- (https://www.reddit.com/r/DataHoarder/comments/xd4zwx/downloading_all_media_in_saved_on_reddit/)
- (https://www.reddit.com/r/Python/comments/v5e8lu/the_ultimate_reddit_media_downloader/)
- (https://github.com/Jackhammer9/RedDownloader)

## i.redd.it图片直接加载

假如我要打开`https://i.redd.it/6h97arsgih171.jpg`这种链接，由于网站限制，并不能直接ctrl+s来保存。

搜了下，倒是有个火狐扩展可以专门干这个事(https://addons.mozilla.org/en-US/firefox/addon/load-reddit-images-directly/) Load Reddit Images Directly，代码详见(https://github.com/nopperl/load-reddit-images-directly) 。代码可以参考下，不过这个火狐扩展用起来体验并不好。
有人发issue问这扩展能不能改成用户脚本让人更方便使用(https://github.com/nopperl/load-reddit-images-directly/issues/5)，作者好像没有打算。

我找到另外一个油猴脚本使用体验不错（比上面那个火狐扩展好，可以替代它）：(https://sleazyfork.org/zh-CN/scripts/109-handy-image)。代码详见(https://github.com/Owyn/HandyImage)

## 其他

reddit上也有一些相关讨论，我尝试对比了下，其他工具好像没saveddit好用。至于为啥用起了reddit，那是因为我的小蓝鸟账号已经被封禁了两个。众里寻他千百度，一夜回到解放前。雄关漫道真如铁，而今迈步从头越。希望这次不要再搞没收藏夹了。我现在只想用我的大资料狠狠地塞满你的小硬盘。

王一之 发表于 2024-6-8 10:14:09

哥哥发我一下内容试试？（底部邮箱）可能误报了

朱焱伟 发表于 2024-6-8 11:46:37

王一之发表于 2024-6-8 10:14
哥哥发我一下内容试试？（底部邮箱）可能误报了

好的——————　

王一之 发表于 2024-6-8 23:01:00

处理了哥哥

页: [1]

油猴中文网's Archiver

爬取Reddit收藏夹