Free and Open-Source Automated Open Access Preprint Harvesting

Universities are attempting to ensure that all of their research is publicly accessible because of funding mandates. Many universities have established campus open access (OA) repositories but are struggling with how to upload millions of manuscripts under numerous license agreements while also linking metadata to make them discoverable. To do this manually requires around 15 minutes per manuscript from an experienced librarian. The time and cost to do this campus-wide is prohibitive. To radically reduce the time and costs of this process and to harvest all past work, this article reports on the development and testing of a free and open source (FOSS) JavaScript-based application, aperta-accessum , which does the following: 1) harvests names and emails from a department ’ s faculty webpage; 2) identifies scholars ’ Open Researcher and Contributor IDentifiers (ORCID iDs); 3) obtains digital object identifiers (DOIs) of publications for each scholar; 4) checks for existing copies in an institution ’ s OA repository; 5) identifies the legal opportunities to provide OA versions of all of the articles not already in the OA repository; 6) sends authors emails requesting a simple upload of author manuscripts; and 7) adds link-harvested metadata from DOIs with uploaded preprints into a bepress repository; the code can be modified for additional repositories. The results of this study show that, in the administrative time needed to make a single document OA manually, aperta-accessum can process approximately five entire departments worth of peer-reviewed articles. Following best practices discussed, it is clear that this open-source OA harvester enables institutional library ’ s stewardship of OA knowledge on a mass scale for radically reduced costs.


INTRODUCTION
In order for science to progress optimally, it is understood that scientific knowledge must be commonly owned (Sismondo, 2010, p. 24); however, scientific progress is impeded by restricting access to copyrighted scientific literature without payment (Gibbons, 1994;Heise & Pearce, 2020). This collateral damage of intellectual property concept (Boldrin & Levine, 2008) has the unintended consequences of copyright laws and paywalls restricting access to the scientific peer-reviewed literature (Lewis, 2012) to the point that even wealthy Harvard University is challenged to pay for it (Sample, 2012). This has divided scientists into the "haves and have nots" (Chagas, 2018). Fortunately, the open access (OA) movement is flourishing, making up at least 28% of the literature (Piwowar et al., 2017), and is moving toward the point that the peer-reviewed literature could be universally accessible (Johnston, 2008;Joseph, 2013;Liesegang, 2013). Lewis even argues that OA is inevitable based on the empirical growth rates (2012). The benefits of OA are well established in the literature and include the following: 1) the pragmatic advantage of, by being freely and easily on the Internet, an author's work is available to the widest possible audience (Lewis, 2012); 2) increased citation rates for OA publishing (Antelman, 2004;Harnad & Brody, 2004;Hajjem et al., 2005;Eysenbach, 2006;Swan, 2010;Niyazov et al., 2016); 3) OA provides a means of access to relevant literature for making significant advancements in knowledge (Boote & Beile, 2005;Webster & Watson, 2002); and 4) both increased efficiency and effectiveness of science (Partha & David, 1994). Poynder summarizes "…it is no longer rational, or even necessary, for subscription paywalls to be built between researchers and research" (2011).
Not surprisingly, a growing list of funders demand OA for studies that they finance. There is a particularly strong case for OA for public funding of science (Suber, 2003;Suber, 2012;Heise & Pearce, 2020) based on the simple idea that, if the public funds research, the public should at the very least be able to read it. Globally, 87 major funders and 57 funder-research organizations already demand OA of work that they fund (ROARM, 2021). Finally, over 850 universities and research organizations have also mandated that researchers share their work OA (ROARM, 2021). As universities navigate how to transition to OA, they are developing OA policies and then moving to support their researchers in making OA possible. A popular method in some areas has been to finance gold OA fees for a single publisher (e.g., Sweden's deal with Springer/Nature publishing group that allows all Swedish academics to publish their work OA for free in the entire publishing line, which increases OA and simplifies workflows for researchers but risks conserving the high costs associated with the status quo) (Olsson et al., 2020). For universities in nations that do not offer such a program, individual universities have also experimented with funding article processing charges (APCs) for OA for their faculty members (Gyore et al., 2015). In contrast, the most-common method to provide OA for universities has been to develop university-specific institutional OA repositories (Pinfield et al., 2014;Liauw & Genoni, 2017) and encourage (or require) that the faculty members deposit their work there. This method supports green OA, which refers to the growing self-archiving (Swan & Brown, 2005) of a version of the article (normally not the final published version) on an institutional repository (IR). For example, Western University is considering a strong OA policy (Western Libraries, 2021) that would grant the university non-exclusive permission to archive and disseminate articles via Western's IR. Under the policy, university community members agree to publish in OA publications and/or deposit scholarly work in Western's OA IR, Scholarship@Western, or in a disciplinary repository such as arXiv as early as possible, ideally sometime between the date of acceptance and the date of publication. This will obviously be a boon for making Western University's scholarship more accessible and help accelerate science worldwide, as those without library subscriptions must currently pay about $35 per article for those behind paywalls. Providing free access to all of a university's scholarship this way (although no additional funds go to the publishers) does, however, involve an enormous amount of work with associated internal costs for the millions of articles. To radically reduce the time and costs of this process and harvest all past work, this article reports on the development and testing of a free and open-source (FOSS) JavaScript-based application aperta-accessum, which does the following: 1) harvests names and emails from an academic department's faculty webpage; 2) identifies the scholars' Open Researcher and Contributor IDentifiers (ORCID iDs); 3) obtains the full list of the digital object identifiers (DOIs) of scholarly publications for each scholar; 4) determines whether the articles already exist in an institution's OA database; 5)identifies the legal opportunities to provide OA versions of all of the articles not already in the OA database; 6) sends authors an email requesting a simple upload of the author's manuscripts; and 7) adds link-harvested metadata from the individual DOI and an uploaded preprint into a bepress repository; the code can be modified for additional repositories. The time savings that aperta-accessum provides librarians in facilitating making articles OA in bulk are quantified, best practices are reviewed, and the ability of this OA harvester is discussed in the context of the future of institutional library's stewardship of OA knowledge.
There is a widespread concern among professors (Beaubien & Eckard, 2014) that requiring OA would be onerous because publication in high-impact journals is an important component of demonstrating expertise for grants, tenure, and promotion, and many OA journals, because they are, in general, newer, carry lower impact factor scores (Pearce, 2022). This challenge can be overcome in two ways. First, nearly all publishers and journals allow preprint/accepted manuscript posting in traditional subscription journals. Second, authors can pay APCs to OA journals or OA fees to hybrid journals (e.g., subscription journals that allow authors to pay for OA for a specific article). Over 17,000 journals offer a means of OA, and over 12,000 have no APCs (Directory of Open Access Journals, 2021). To recruit faculty participation, which is critical for the success of even this automated process, care must be taken (Otto, 2016) with the necessary marketing needed to make faculty aware of OA in general (Colla, 2020;Kakai, 2021); this initiative in particular is important because of the risks of filling faculty inboxes with preprint requests.

Open-source software design and operation
JavaScript was used to write the scripts for aperta-accessum, which is released under the GNU General Public License (GPL) version 3 (GNU, 2007) and is available on GitHub (https:// github.com/jackpeplinski/aperta-accessum). A registry of this article's version is available on the Open Science Framework (https://doi.org/10.17605/OSF.IO/7MECZ). These scripts were written to be run using Node.js. The upload site for aperta-accessum was created primarily using a variety of JavaScript libraries and frameworks, including React, Material-UI, Emotion, and React Dropzone. A potential workflow using aperta-accessum is shown in Figure 1.
The administrator is the person responsible for the maintenance of the IR. In Figure 1, a bepress repository was the IR used.

Potential workflow description
Stage 1 of aperta-accessum: administrator article identification. This stage provides the administrator with a comma-separated values (CSV) file of article titles; DOIs; and authors' first names, last names, and email addresses. This CSV can be used to send emails to prompt researchers to upload their articles to the IR. The administrator can begin the workflow by executing the command "node sendEmail.js" from the command line. This command will run the sendEmail.js script.
The "getPeople(scrapeURL)" function is executed first. This function scrapes emails and first and last names from a specified URL (e.g., a faculty directory page).
The "getORCIDID(fName, lName, institution)" function uses these scraped first and last names to get the ORCID iDs of all authors with the same first and last name at the specified institution from the ORCID database. These parameters were recommended by Western librarians as sufficient to determine whether the works of the ORCID iD should be included in the university's IR.

JLSC
For each ORCID iD, the "getDOIs(ORCIDID)" function is executed. This function gets the DOIs for works listed for the ORCID iD from the ORCID database.
For each DOI, the "getDuplicateDOIStatus(DOI)," "getDuplicateTitleStatus(DOI)," "getPermissionsStatus(DOI)," "getOpenAccessStatus(DOI)," and "getTitle(DOI)" functions are executed. These functions use application programming interfaces (APIs); the bepress API is used for the first two function calls and then OA. Works, Unpaywall, and Crossref, respectively, as shown in Figure 1, were used to determine whether the article is already present in the IR, whether the article is able to be OA, and whether the article is already OA, as well as to get the title of the article.
If the article is not already present in the IR but is eligible to be OA, and if the title is available, an entry in the CSV is created. If these previous conditions are met and the article is already OA, the DOI and URL where the file is available are added to a JSON (JavaScript Object Notation) file. JSON is a lightweight data-interchange format, which is both easy for humans to read and write as well as easy for machines to parse and generate.
The administrator can then use the CSV to email professors to prompt them to upload their articles to the IR. This can be done using Outlook's mail merge feature, for example.
Stage 2 of aperta-accessum: professor submission of articles. Once the professor has received an email, they will click the custom URL and be directed to a webpage. The webpage is included in the aperta-accessum repository and needs to be deployed only once; further instructions are included in the Github Readme. The webpage will display the DOI and title of the article; the professor will be prompted to upload the article and click a "submit" button.
When the submit button is clicked, the file is automatically uploaded to a specified Dropbox folder. The Dropbox folder needs to be configured only once; further instructions are included in the Github Readme. The name of the file is the article's DOI, modified to comply with Dropbox's file naming rules. For bepress, uploading the file to a cloud service such as Dropbox is required because bepress requires a public URL ending in ".pdf." Stage 3 of aperta-accessum: administrator database inclusion. The administrator should execute the command "node createXML.js" using the command line when either their Dropbox storage is reaching capacity or few new uploads are being made. This command will run the createXML.js script and create an XML file, upload.xml, that can be uploaded to bepress, which will enter the articles in Dropbox into the bepress database.
The "getFileNames()" function is executed first. This function gets all of the names of files in the specified Dropbox folder. The names of these files are the article's DOIs, which are modified to comply with Dropbox's file naming rules.
The "changeNameToDOI(name)" function changes the file's name from the modified Dropbox form back to the proper DOI.
The "createXML(DOI, name)" function uses this proper DOI and name to create an XML file. This function includes two other function calls, "getFullTextURL(name)" and "getMetadata(DOI)." The information from both of these functions is used to create the XML.
The "getFullTextURL(name)" function gets a URL from Dropbox as the content of the "fulltext-url" XML tag.
The "getMetadata(DOI)" function gets metadata from Crossref for the DOI. This metadata is used as the content for the "title," "publication-date," and "author" XML tags.
When the createXML.js file execution has been completed, the administrator can upload the upload.xml file to bepress, which concludes the workflow.

Experimental trials
To compare the time savings of aperta-accessum with processing publications manually, it was run for a department, the Western University's Electrical and Computer Engineering Department (https://www.eng.uwo.ca/electrical/people/faculty/index.html), and a highproductivity professor. The times for the administrator to complete the stages they were responsible for, Stage 1 and 3, were logged and repeated three times by one librarian at Western.

Institutional context
Western University is a large research-intensive university located in London, Ontario, Canada that is part of the U15 Group of Canadian Research Universities (U15, 2022). Western prides itself on research excellence and the success of its 40,000-member student body. Western identifies "Greater Impact" as one of the key pillars on which it anchors its latest strategic plan. One important way of achieving this goal is to accelerate research, scholarship, and creativity "to serve not only individual disciplines but also the public goodby advancing knowledge and sharing it…" (Western University, 2021). As previously noted, OA is one way of serving the public good, as it facilitates equitable access to academic scholarship by eliminating access and financial barriers.
In 2019, Western established the Provost's Task Force on Open Access, which undertook a campus-wide consultation process toward an institutional policy on OA. As of this writing, the draft policy is still under review, and workflows are being established to support faculty in depositing their work under the new policy.

Limitations and future work
aperta-accessum is only as accurate as the APIs that it relies on. If there is incomplete or incorrect data in the databases that aperta-accessum uses, it will not catch these issues. For example, aperta-accessum searches the ORCID database using an author's first and last name and institution to get a list of publications; if authors do not have their publications linked to their ORCID profile or if these publications do not have DOIs, aperta-accessum cannot send emails for these missing articles to be uploaded.
Additionally, bepress does not support an API to upload or revise articles. If bepress adds this functionality in the future, it would streamline the aperta-accessum process significantly by removing the need for Dropbox, the creation of an XML file, and the editing and upload of a revised CSV file.
The primary current limitation of aperta-accessum is that it only works for bepress, which has been criticized as it is now owned by a for-profit academic publisher (McKenzie, 2017). The code of aperta-accessum could be expanded to enable its use in other commonly used opensource OA stacks such as D-Space, Islandora, and Samvera. Customizing aperta-accessum for these other applications would require modifying the code for Stage 2 and Stage 3. The difficulty of this modifications depends on the specific application, but, in general, open-source applications have more API functionality, which could remove the need to upload the XML, and delete the Dropbox files, which would make modifying the code easier.
Numerous other improvements could be made to aperta-accessum, including the following: 1. Creating a user interface to receive API tokens, any other required parameters, and buttons to run the scripts on click. A user interface would make aperta-accessum more accessible to administrators with no or minimal command line experience. If a user interface was built, it would likely need to be either a desktop or cloud application because bepress does not allow cross-origin resource sharing (i.e., API calls cannot be made to bepress from within the browser of a non-bepress URL). 2. Automating the upload of the XML file containing new articles to bepress.
3. Automating the upload and creations of a CSV file for revision of articles already in bepress.
4. Automating the scripts to run at set intervals (e.g., run "sendEmail.js" bimonthly). 5. Providing better terminal output (e.g., color code key words, add loading animations, etc). 6. Testing the tool with a larger department or faculty.
Finally, Willinsky points out that there is a convergence between open source, OA, and open science (2005). There are well-known benefits not only for science but also for researchers for using open research practices: increases in citations, media attention, potential collaborators, job opportunities, and funding opportunities (McKiernan et al., 2016). To optimize this convergence, aperta-accessum could be expanded to other areas of open science (Spellman et al., 2017), including the following: data sets (Chen et al., 2018;Kazmi et al., 2021); FOSS (Von Krogh & Von Hippel, 2006;Von Krogh & Spaeth, 2007); free and open source hardware (FOSH) (Pearce, 2013;Pearce, 2015;Pearce, 2016;Maia Chagas, 2018); and new Elsevier companion journals such as Data in Brief, MethodsX, SoftwareX, and HardwareX.

RESULTS
The open-source code for aperta-accessum was successfully developed for the workflow described in the Methods section. Figure 2 shows the results of the first script of Stage 1 when used on Western University's Electrical and Computer Engineering Department directory page. As Figure 2 shows, 37 people were found on the directory page, eight ORCID iDs were found for these people, 905 identifiers (e.g., DOIs) were found for these ORCID iDs, 280 DOIs were already OA, and 12 DOIs were found and were ready to email. Figure 3 shows an email generated to prompt researchers to upload their articles. Figure 4 shows the custom website that a professor is sent to upload the preprint or accepted manuscript of a specific article, and Figure 5 shows the successfully uploaded page. Of the 12 emails sent, 10 responses were received.    The results of the time trials are shown in Tables 1 and 2 for a department and individual researcher, respectively. The administrator can start running a script and let it complete in the background. The time the administrator spent directly working with aperta-accessum (e.g., starting a script, uploading a file, etc.) was classified as administrator time. The total time to complete the stage (i.e., administrator time plus the time to complete any background tasks) was classified as total time. As can be seen, the administrative time invested in running a department is under 3 minutes, and the total time is under 20 minutes.  Table 2. Timed-trial results of stages when using aperta-accessum on a high-productivity researcher As can be seen in Table 2, the administrative time to run an individual researcher is approximately the same as to run a department; however, because a department has many more papers to process, the total time is less for a researcher (e.g., < 5 min even for high-productivity individuals).

DISCUSSION
The results of the development and testing of aperta-accessum will first be discussed in terms of its performance and ability to make an entire institutions' scholarship OA. The best practices to roll out aperta-accessum are outlined. Next, the limitations are detailed, and future work is presented. Finally, the potential long-term impact of widespread adoption of this FOSS tool is discussed.

Performance
Tables 1 and 2 show that total administration time (i.e., the time a person would have to spend using aperta-accessum) is approximately 3 minutes, regardless of the number of papers. Therefore, this is a substantial improvement, as 15 minutes per manuscript is needed to accomplish the same process manually.
The librarian using the tool received an hour-long training session from the student who developed the software. After this hour-long training session, the librarian was able to complete all stages of the workflow. The Electrical and Computer Engineering Department was told verbally, by another professor, that they may receive emails asking them to upload research. Of the 10 responses, 2 responses attached documents instead of using the upload link, but these responses were from the same professor. All other respondents uploaded their documents correctly, indicating that they understood the process. There were no false-positives or issues with the uploads, but rolling the software out to a larger group would require some adjustments, which are detailed in the following section.

Rollout best practices
Capitalizing on the launch of Western's OA policy will be key in the rollout and success of aperta-accessum. Another important consideration will be to target departments and faculties where OA publishing is the default while simultaneously working with the broader campus community to raise awareness of the benefits of OA and Western's IR, Scholarship@Western. Acknowledging that uptake of OA is often dependent on unique disciplinary publishing cultures (Severin et al., 2018), it will be important to tailor messaging that meets the diverse needs of scholars. As aperta-accessum relies on data from ORCID, parallel efforts to support faculty in populating their ORCID profiles will be critical for maximizing the utility of apertaaccessum. This can be seen in the results from Figure 2, as most faculty currently do not have ORCID profiles. This process will take time and require coordinated communication between campus stakeholders, but the potential payoffs in terms of demonstrating research impact and public accountability are great. Currently, only 39% of Western University's scholarship is OA (COKI, 2022), which puts Western just under the average of the U15, as shown in Table 3. The effectiveness of the rollout can be determined by monitoring this statistic.
The aperta-accessum software is an exciting tool that will complement existing workflows as it is intended to capture previously published research articles, thus making it easier for interested faculty to upload all of their scholarship to the repository.

Potential long-term impact
As aperta-accessum is a completely FOSS tool, it demands that any person or institution that adapts it to their own repository must reshare the code to benefit the overall global community following the tenants of the GNU GPL. There is a need for this to occur because this article only reports on the ability of aperta-accessum to function for the proprietary bepressbased IRs. As outlined in the previous section, there are many types of IRs commonly used. If only a handful of institutions make the relatively minor investment in adapting apertaaccessum to meet their own repository's requirements, it is theoretically possible that the entirety of academic output would at least provide relatively easy metadata-tagged OA to the literature. The political feasibility of this seems reasonably possible, as many academics are calling for unlimited access to the entire peer-reviewed literature (Budapest Open Access Initiative, 2002). There is already considerable evidence that both the number of platinum OA journals (free to read, with no APC for authors) as well as platinum OA with impact factors is growing rapidly (Pearce, 2022). There is also widespread interest in vastly expanding transparency in all aspects of the scientific knowledge-generating process (European Commission, 2015). Furthermore, recent research has shown a clear willingness of academics to expand OA, which would hasten scientific progress while also making science more just and inclusive (Pearce et al., 2022a(Pearce et al., , 2022b. The results of this article indicate that functionally doing this would not be prohibitively expensive or time consuming for the past literature and would provide a legal means to provide the same level of access that Sci-Hub provides illegally (Bohannon, 2016). This will likely put economic pressure on the current business models of scientific publishers, which have been heavily criticized for profiteering from predominantly publicly funded research (Eisen, 2003;Monbiot, 2011;Buranyi, 2017). The conflict between green OA (self-archiving) and gold OA (APCs) is not yet resolved (Albert, 2006). A small percentage of all academic articles have been self-archived, but universal online access may be more readily available because of the use of aperta-accessum and the self-interest of scholars to have an easy way to have their work read and cited more often, thereby increasing their prestige. In addition to the rise of platinum OA journals with impact factors that enable academics to gain academic status and fully share OA without charges (Pearce, 2022), aperta-accessum also offers another path to relatively easy OA in journals with impact factors. This will add price pressure on journals as well as the need to find new business models in academic publishing.

CONCLUSION
This study has provided an economic method for universities that have established campus OA repositories to upload millions of manuscripts under numerous license agreements while also linking metadata to make them discoverable. The development and testing of a FOSS JavaScript-based application, aperta-accessum, was described in detail. The results show that aperta-accessum is capable of radical time savings for harvesting OA articles. In the administrative time that it takes to manually make a single document OA, an administrator using aperta-accessum can now process approximately five entire departments worth of OA articles. This study demonstrated aperta-accessum for a single type of repository. There is future work needed to adapt it for all OA repositories to enable universal OA to the peer-reviewed literature.

ACKNOWLEDGMENTS
This work was supported by the Thompson Endowment.

DISCLOSURES
The authors declare that they do not have any conflict of interest, financial or otherwise.