Skip to main content

Apache Opendal S3

The UnstructuredApacheOpendalS3FileLoader class is a document loader that loads documents from an S3 bucket using the Apache Opendal library. The loader is designed to work with the unstructured library and is compatible with the unstructured document processing pipeline.

Overviewโ€‹

Integration detailsโ€‹

ClassPackageLocalSerializableJS support
UnstructuredApacheOpendalS3FileLoaderlangchain_communityโœ…โŒโŒ

Loader featuresโ€‹

SourceDocument Lazy LoadingNative Async Support
UnstructuredApacheOpendalS3FileLoaderโœ…โŒ

Setupโ€‹

Credentialsโ€‹

No credentials are required to use the UnstructuredApacheOpendalS3FileLoader.

Installationโ€‹

%pip install --upgrade --quiet  opendal  unstructured

Instantiationโ€‹

Now we can instantiate our document loader object and load Documents:

from langchain_community.document_loaders.apache_opendal_s3 import (
UnstructuredApacheOpendalS3FileLoader,
)
key = "data2.csv"
bucket = "liugddx"
region_name = "ap-northeast-1"
loader = UnstructuredApacheOpendalS3FileLoader(
key,
bucket,
region_name,
aws_access_key_id="xxx",
aws_secret_access_key="xxx",
)

Loadโ€‹

Use .load() to load the documents from the S3 bucket. The loader will return a list of documents.

docs = loader.load()
/Users/liugddx/code/langchain/.venv/lib/python3.10/site-packages/unstructured/partition/csv.py:84: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support sep=None with delim_whitespace=False; you can avoid this warning by specifying engine='python'.
dataframe = pd.read_csv(file, header=ctx.header, sep=ctx.delimiter)

Returns each table row as dict.

len(docs)
1
docs[0].page_content
'\n\n\n\nD\n\n\nA10101010\nNone\n\n\n'

Lazy Loadโ€‹

The UnstructuredApacheOpendalS3FileLoader supports lazy loading. This means that the documents are not loaded into memory until they are accessed. This can be useful when working with large documents.

for doc in loader.lazy_load():
print(doc.page_content)




D


A10101010
None
``````output
/Users/liugddx/code/langchain/.venv/lib/python3.10/site-packages/unstructured/partition/csv.py:84: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support sep=None with delim_whitespace=False; you can avoid this warning by specifying engine='python'.
dataframe = pd.read_csv(file, header=ctx.header, sep=ctx.delimiter)

API referenceโ€‹

For further information, please refer to the API reference.


Was this page helpful?


You can also leave detailed feedback on GitHub.