-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: Browserbase Web Reader (#12877)
- Loading branch information
1 parent
0f8a6ef
commit 8f3b518
Showing
7 changed files
with
155 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
5 changes: 5 additions & 0 deletions
5
...ntegrations/readers/llama-index-readers-web/llama_index/readers/web/browserbase_web/BUILD
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
python_sources() | ||
|
||
python_requirements( | ||
name="reqs", | ||
) |
47 changes: 47 additions & 0 deletions
47
...aders/llama-index-readers-web/llama_index/readers/web/browserbase_web/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
# Browserbase Web Reader | ||
|
||
[Browserbase](https://browserbase.com) is a serverless platform for running headless browsers, it offers advanced debugging, session recordings, stealth mode, integrated proxies and captcha solving. | ||
|
||
## Installation and Setup | ||
|
||
- Get an API key from [browserbase.com](https://browserbase.com) and set it in environment variables (`BROWSERBASE_API_KEY`). | ||
- Install the [Browserbase SDK](http://github.com/browserbase/python-sdk): | ||
|
||
``` | ||
pip install browserbase | ||
``` | ||
|
||
## Usage | ||
|
||
### Loading documents | ||
|
||
You can load webpages into LlamaIndex using `BrowserbaseWebReader`. Optionally, you can set `text_content` parameter to convert the pages to text-only representation. | ||
|
||
```python | ||
from llama_index.readers.web import BrowserbaseWebReader | ||
|
||
|
||
reader = BrowserbaseWebReader() | ||
docs = reader.load_data( | ||
urls=[ | ||
"https://example.com", | ||
], | ||
# Text mode | ||
text_content=False, | ||
) | ||
``` | ||
|
||
### Loading images | ||
|
||
You can also load screenshots of webpages (as bytes) for multi-modal models. | ||
|
||
```python | ||
from browserbase import Browserbase | ||
from base64 import b64encode | ||
|
||
browser = Browserbase() | ||
screenshot = browser.screenshot("https://browserbase.com") | ||
|
||
# Optional. Convert to base64 | ||
img_encoded = b64encode(screenshot).decode() | ||
``` |
Empty file.
48 changes: 48 additions & 0 deletions
48
...egrations/readers/llama-index-readers-web/llama_index/readers/web/browserbase_web/base.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
import logging | ||
from typing import Optional, Iterator, Sequence | ||
from llama_index.core.readers.base import BaseReader | ||
from llama_index.core.schema import Document | ||
|
||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
class BrowserbaseWebReader(BaseReader): | ||
"""BrowserbaseWebReader. | ||
Load pre-rendered web pages using a headless browser hosted on Browserbase. | ||
Depends on `browserbase` package. | ||
Get your API key from https://browserbase.com | ||
""" | ||
|
||
def __init__( | ||
self, | ||
api_key: Optional[str] = None, | ||
) -> None: | ||
try: | ||
from browserbase import Browserbase | ||
except ImportError: | ||
raise ImportError( | ||
"`browserbase` package not found, please run `pip install browserbase`" | ||
) | ||
|
||
self.browserbase = Browserbase(api_key=api_key) | ||
|
||
def lazy_load_data( | ||
self, urls: Sequence[str], text_content: bool = False | ||
) -> Iterator[Document]: | ||
"""Load pages from URLs.""" | ||
pages = self.browserbase.load_urls(urls, text_content) | ||
|
||
for i, page in enumerate(pages): | ||
yield Document( | ||
text=page, | ||
metadata={ | ||
"url": urls[i], | ||
}, | ||
) | ||
|
||
|
||
if __name__ == "__main__": | ||
reader = BrowserbaseWebReader() | ||
logger.info(reader.load_data(urls=["https://example.com"])) |
1 change: 1 addition & 0 deletions
1
.../readers/llama-index-readers-web/llama_index/readers/web/browserbase_web/requirements.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
browserbase |