Whole-of-Australian Government Web Crawl

ARCHIVED

Created 06/08/2018

Updated 13/11/2024

Includes publicly-accessible, human-readable material from Australian government websites, using the Organisations Register (AGOR) as a seed list and domain categorisation source, obeying robots.txt and sitemap.xml directives, gathered over a 10-day period.

Several non-*.gov.au domains are included in the AGOR - these have been crawled up to a limit of 10K URLs.

Several binary file formats included and converted to HTML: doc,docm,docx,dot,epub,keys,numbers,pages,pdf,ppt,pptm,pptx,rtf,xls,xlsm,xlsx

URLs returning responses larger than 10MB are not included in the dataset.

Raw gathered data (including metadata) is published in the Web Archive (WARC) format, in both a single, multi-gigabyte WARC file and split series.

Metadata extracted from pages after filtering is published in JSON format, with fields defined in a data dictionary.

Licence

Web content contained within these WARC files has originally been authored by the agency hosting the referenced material. Authoring agencies are responsible for the choice of licence attached to the original material.

A consistent licence across the entirety of the WARC files' contents should not be assumed. Agencies may have indicated copyright and licence information for a given URL as metadata embedded in a WARC file entry, but this should not be assumed to be present for all WARC entries.

Files and APIs

Tags

Additional Info

Field Value
Title Whole-of-Australian Government Web Crawl
Language English
Licence Other
Landing Page https://devweb.dga.links.com.au/data/dataset/99f43557-1d3d-40e7-bc0c-665a4275d625
Contact Point
Digital Transformation Agency
data@digital.gov.au
Reference Period 03/08/2018 - 20/11/2018
Geospatial Coverage Australia
Data Portal data.gov.au

Data Source

This dataset was originally found on data.gov.au "Whole-of-Australian Government Web Crawl". Please visit the source to access the original metadata of the dataset:
https://devweb.dga.links.com.au/data/dataset/whole-of-australian-government-web-crawl

No duplicate datasets found.