|  View source on GitHub | 
Manages the download and extraction of files, as well as caching.
tfds.download.DownloadManager(
    *,
    download_dir: epath.PathLike,
    extract_dir: Optional[epath.PathLike] = None,
    manual_dir: Optional[epath.PathLike] = None,
    manual_dir_instructions: Optional[str] = None,
    url_infos: Optional[Dict[str, checksums.UrlInfo]] = None,
    dataset_name: Optional[str] = None,
    force_download: bool = False,
    force_extraction: bool = False,
    force_checksums_validation: bool = False,
    register_checksums: bool = False,
    register_checksums_path: Optional[epath.PathLike] = None,
    verify_ssl: bool = True,
    max_simultaneous_downloads: Optional[int] = None
)
Downloaded files are cached under download_dir. The file name of downloaded
 files follows pattern "{sanitized_url}{content_checksum}.{ext}". Eg:
 'cs.toronto.edu_kriz_cifar-100-pythonJDF[...]I.tar.gz'.
While a file is being downloaded, it is placed into a directory following a similar but different pattern: "{sanitized_url}{url_checksum}.tmp.{uuid}".
When a file is downloaded, a "{fname}.INFO.json" file is created next to it. This INFO file contains the following information: {"dataset_names": ["name1", "name2"], "urls": ["http://url.of/downloaded_file"]}
Extracted files/dirs are stored under extract_dir. The file name or
directory name is the same as the original name, prefixed with the extraction
method. E.g.
 "{extract_dir}/TAR_GZ.cs.toronto.edu_kriz_cifar-100-pythonJDF[...]I.tar.gz".
The function members accept either plain value, or values wrapped into list or dict. Giving a data structure will parallelize the downloads.
Example of usage:
# Sequential download: str -> str
train_dir = dl_manager.download_and_extract('https://abc.org/train.tar.gz')
test_dir = dl_manager.download_and_extract('https://abc.org/test.tar.gz')
# Parallel download: list -> list
image_files = dl_manager.download(
    ['https://a.org/1.jpg', 'https://a.org/2.jpg', ...])
# Parallel download: dict -> dict
data_dirs = dl_manager.download_and_extract({
   'train': 'https://abc.org/train.zip',
   'test': 'https://abc.org/test.zip',
})
data_dirs['train']
data_dirs['test']
For more customization on the download/extraction (ex: passwords, output_name,
...), you can pass a tfds.download.Resource as argument.
| Raises | |
|---|---|
| FileNotFoundError | Raised if the register_checksums_path does not exist. | 
Methods
download
download(
    url_or_urls
)
Download given url(s).
| Args | |
|---|---|
| url_or_urls | url or list/dictof urls to download and extract. Each
url can be astrortfds.download.Resource. | 
| Returns | |
|---|---|
| downloaded_path | s
 | 
download_and_extract
download_and_extract(
    url_or_urls
)
Download and extract given url_or_urls.
Is roughly equivalent to:
extracted_paths = dl_manager.extract(dl_manager.download(url_or_urls))
| Args | |
|---|---|
| url_or_urls | url or list/dictof urls to download and extract. Each
url can be astrortfds.download.Resource.  If not explicitly
specified inResource, the extraction method will automatically be
deduced from downloaded file name. | 
| Returns | |
|---|---|
| extracted_path | s
 | 
download_checksums
download_checksums(
    checksums_url
)
Downloads checksum file from the given URL and adds it to registry.
download_kaggle_data
download_kaggle_data(
    competition_or_dataset: str
) -> epath.Path
Download data for a given Kaggle Dataset or competition.
| Args | |
|---|---|
| competition_or_dataset | Dataset name ( zillow/zecon) or competition name
(titanic) | 
| Returns | |
|---|---|
| The path to the downloaded files. | 
extract
extract(
    path_or_paths
)
Extract given path(s).
| Args | |
|---|---|
| path_or_paths | path or list/dictof path of file to extract. Each path
can be astrortfds.download.Resource.  If not explicitly specified
inResource, the extraction method is deduced from downloaded file
name. | 
| Returns | |
|---|---|
| extracted_path | s
 | 
iter_archive
iter_archive(
    resource: ExtractPath
) -> Iterator[Tuple[str, typing.BinaryIO]]
Returns iterator over files within archive.
Important Note: caller should read files as they are yielded. Reading out of order is slow.
| Args | |
|---|---|
| resource | path to archive or tfds.download.Resource. | 
| Returns | |
|---|---|
| Generator yielding tuple (path_within_archive, file_obj). |