Apache Opendal S3
The UnstructuredApacheOpendalS3FileLoader class is a document loader that loads documents from an S3 bucket using the Apache Opendal library. The loader is designed to work with the unstructured library and is compatible with the unstructured document processing pipeline.
Overviewโ
Integration detailsโ
| Class | Package | Local | Serializable | JS support |
|---|---|---|---|---|
| UnstructuredApacheOpendalS3FileLoader | langchain_community | โ | โ | โ |
Loader featuresโ
| Source | Document Lazy Loading | Native Async Support |
|---|---|---|
| UnstructuredApacheOpendalS3FileLoader | โ | โ |
Setupโ
Credentialsโ
No credentials are required to use the UnstructuredApacheOpendalS3FileLoader.
Installationโ
%pip install --upgrade --quiet opendal unstructured
Instantiationโ
Now we can instantiate our document loader object and load Documents:
from langchain_community.document_loaders.apache_opendal_s3 import (
UnstructuredApacheOpendalS3FileLoader,
)
key = "data2.csv"
bucket = "liugddx"
region_name = "ap-northeast-1"
loader = UnstructuredApacheOpendalS3FileLoader(
key,
bucket,
region_name,
aws_access_key_id="xxx",
aws_secret_access_key="xxx",
)
Loadโ
Use .load() to load the documents from the S3 bucket. The loader will return a list of documents.
docs = loader.load()
/Users/liugddx/code/langchain/.venv/lib/python3.10/site-packages/unstructured/partition/csv.py:84: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support sep=None with delim_whitespace=False; you can avoid this warning by specifying engine='python'.
dataframe = pd.read_csv(file, header=ctx.header, sep=ctx.delimiter)
Returns each table row as dict.
len(docs)
1
docs[0].page_content
'\n\n\n\nD\n\n\nA10101010\nNone\n\n\n'
Lazy Loadโ
The UnstructuredApacheOpendalS3FileLoader supports lazy loading. This means that the documents are not loaded into memory until they are accessed. This can be useful when working with large documents.
for doc in loader.lazy_load():
print(doc.page_content)
D
A10101010
None
``````output
/Users/liugddx/code/langchain/.venv/lib/python3.10/site-packages/unstructured/partition/csv.py:84: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support sep=None with delim_whitespace=False; you can avoid this warning by specifying engine='python'.
dataframe = pd.read_csv(file, header=ctx.header, sep=ctx.delimiter)
API referenceโ
For further information, please refer to the API reference.
Relatedโ
- Document loader conceptual guide
- Document loader how-to guides