Investigating missing DOIs

The purpose of this work derives from the research questions that have guided it from the very beginning. They concern: the identification of the publishers responsible (due to their incorrect metadata sent to Crossref) for the missing citations in COCI; the identification of the publishers to which such invalid citations point (i.e. who published the cited articles); and the number of currently valid citations among those initially invalid according to our input data.



Introduction

OpenCitations represents a relevant project inside the Open Science domain, in particular for what concerns the open scholarly citations. Indeed, its capability of conveying a relevant number of (open) citational information allows the creation of a wide net for the scientific community. COCI is the OC's Index of Crossref open DOI-to-DOI citations. All the information collected by OpenCitations and stored in COCI comes from Crossref. However, not all Crossref citations are transferred in COCI: some of them voluntarily, but some others because of technical errors, since their DOI is invalid. Hence, it represents a loss of information not only for OpenCitations but also for the scientific community. Moreover, also the publishers of the article connected to these invalid identifiers are directly affected by this gap from two different perspectives: they can be the direct responsibles for committing the errors which invalidate the DOIs of the cited source, since Crossref does not perform a double-check of the information provided by the publishers; or they can be the publishers of the articles which are wrongly cited and, because of this, they lose all the advantages of being citable (e.g. acknowledgments).

The purpose of this work derives from the research questions that have guided it from the very beginning. They concern:

In particular, among the data obtained and collected in the output file, we have selected some more suitable to represent the information found and therefore facilitate its interpretation:

Based on the research questions and hence on the purpose of the project thus defined, the objective of our work is firstly to identify the main publishers involved in these missing citations; and, secondly, patterns of errors and if they could be attributable to certain activities of some publishers; for instance, if the presence of these incorrect DOIs is to attribute exclusively to the incorrect sending of the citational metadata of the publications by the publisher to Crossref. This objective could be translated into another broader objective of making sure that more data can be integrated into COCI by correcting the activities carried out by the publishers that lead to these missing citations and invalid data; or hypothesize a different method of collecting the citational data from publishers that can prevent the insertion of incorrect citational data in platforms such as Crossref.


Which publishers were responsible (due to their incorrect metadata sent to Crossref) for the missing citations in COCI?

To answer this research question, we identified both the names of the publishers responsible for the invalid citations - i.e., the publishers of the valid citing DOIs - and the total number of citing DOIs for each publisher, both those that remained invalid even after the DOI API request and those that became valid. This distinction is useful for outline if there are some publishers more than others whose citations in which they appeared as publisher of the citers were then validated.

Starting from the approach of choosing to retrieve this type of additional information as well, the resulting data turned out to be particularly interesting.

publisher invalid citing valid citing total citing

The following visualization is built upon the twenty most relevant publishers retrieved in our analysis, with respect to both the number of addressed and received citations. First, it is noted that there is only one publisher whose number of articles with citing DOIs is significantly greater than all the others (more than 370000), and that is Ovid Technologies (Wolters Kluwer Health). Other significant publishers for the numbers of citing DOIs are: Springer Science and Business Media LLC, Association for Computing Machinery (ACM), Informa UK Limited and Wiley. As for the other publishers, the number of citing DOIs of which they are publishers is between considerably lower figures: between 5000 DOIs and 25000 DOIs approximately. In the second place, at least for what concerns these selected publishers, the number of citing DOIs of now valid citations - after the DOI API request - is distributed only among a few publishers.

(over with the mouse the bars to see the exact number of valid/invalid citations)


To which publishers did such invalid citations point to (i.e. who published the cited articles)?

The visualization for this second research question and the data upon which it is constructed are referring only to the twenty most relevant publishers. The methodology and the approach followed for this part of the research correspond to those declared for the first part, while the citational data we have considered refer to the publishers of the cited articles.

publisher invalid cited valid cited total cited

The following visualization shows similar results to those shown for the previous research question. Firstly, Ovid Technologies (Wolters Kluwer Health) is still the publisher with the significantly higher number of DOIs cited than the other publishers (almost 380'000, almost triple the number of the second in the standings). Other significant publishers for the numbers of citing DOIs are: Test accounts, Springer Science and Business Media LLC, Wiley and Elsevier BV. As for the data collected to answer this second research question, the numbers are slightly smaller: around the 380'000 DOIs mentioned for the first publisher, while the number for most of the others is concentrated between 130'000 and 4'000. Secondly, also in this case, the number of DOIs mentioned validated with the DOI API request are concentrated among a few selected publishers.

(hover with the mouse the bars to see the exact number of valid/invalid citations)


How many invalid citations are currently valid?

The aim of this question was that of understanding in how many cases the invalidity of the DOIs was due not to the publisher's wrongly provided information to Crossref, but because of external reasons, e.g. because of the lack of information about the cited article, in the moment of the check before the eventual inclusion in COCI. As the results have pointed out, and as it is shown in the graph above, the actual number of citations validated between the OpenCitations check and the current moment, is really low. Indeed only 7.7% of the citations have been validated while the remaining data is still invalid. Thus, even if we cannot say with certainty whether the fault for these missing citations is due to the publisher or to other external reasons, these results clearly show that those reasons which invalidated the citations during the COCI check, still persist.


(hover with the mouse the arches to see the total number of valid/invalid citations)


An interesting case: Self-Citations

From the results obtained in answering the research questions cited in the previous sections, we noticed a curious case: some publishers who are among the top twenty citing also appear to be among the top twenty cited in the missing citations in COCI. This fact intrigued us and led us to investigate the matter further to find out if it is a case of self-citation. So we have reworked the data we already have to organize them in order to save the following information: who are the publishers cited by the first ten publishers and in what number of citations. To represent the data obtained we have chosen to use a sankey diagram. As the chart shows, some interesting cases that emerge with these data are that of Ovid Technologies, which turns out to be an interesting case of self-citation as we had supposed; but also the case of JSTOR appear as a self-citation case for invalid citations; and finally, the last interesting case is that of Association for Computing Machinery (ACM): most of the invalid citations of which it is the responsible publisher point to Test accounts, which is a ghost publisher (i.e., not clearly identified).


Conclusions

In general, our research leads to the deduction that there are certain publishers that send out a large proportion of the invalid citation data to Crossref, while in the receiving chart the most large-scale publishers are the ones most prone to receiving invalid citations. Another point of interest is the relatively large number of invalid citations that have been made to DOIs with invalid prefixes (prefixes not belonging to any publisher in Crossref). We used the umbrella term "unidentified" as the publisher name for all of these DOIs and, as we can see in the second visualization, this "publisher" is one of the publishers with the most citations received. Furthermore, another point of interest is the publisher name "Test accounts" retrieved from the Crossref API: the name of this hypothetical publisher, the lack of more information about it on Crossref, and the fact that among all the processed data it has only been on the receiving part of strictly invalid citation data, all point to the conclusion that such a publisher does not exist. Finally, for a select number of publishers responsible for the most invalid citation data, the number for invalid citation data received is also very high, surprisingly suggesting that these publishers have been auto-citing with invalid (or at least, still not yet validated) citation data, like in the case of Ovid Technologies. Our deduction based on available sources is that this invalid auto-citation might be caused by using internal DOI data not yet available to the public and Crossref.

Overall, the results have shown three major points:

  1. only a few publishers have a high number of invalid citations, of resources either citing or cited, while most of the remaining publishers flattens around small values;
  2. only a small part of the previously invalid citations have successively become valid, and that, therefore, the issues that characterized the previously invalid DOIs are still present;
  3. Some, relevant, publisher are involved in invalid self-citations, such as Ovid Technologies and JSTOR.

The results of the project could be applied to the more general research area of COCI and OpenCitations, as well as the other projects that deal with open citational data. Indeed, this could make sure that more data can be integrated into COCI by grasping the issue of invalid DOIs at the root, therefore not correcting the invalid DOIs already collected, but correcting the activities carried out by the publishers that lead to these missing citations and invalid data; or hypothesize a different method of collecting the citational data from publishers that can prevent the insertion of incorrect citational data in platforms such as Crossref. Further improvements, concerning the recognition and correction of the errors affecting the prefixes, may lead to a broader comprehension of the world of the publishers related to the invalid DOIs, by adding to the research all the DOIs that, for this reason, could not be inserted in the final computations.


About

Card image cap
Alessia Cioffi

Graduated in Classical Literature at the University of Bologna. I'm currently attending the Digital Humanities and Digital Knowledge Master Degree course at the same University.

Card image cap
Arianna Moretti

Graduated in Anthropology at the University of Bologna. I'm currently attending the Digital Humanities and Digital Knowledge Master Degree course at the same University.

Card image cap
Nooshin Shahidzadeh Asadi

Graduated in software engineering at the University of Tehran. I'm currently attending the Digital Humanities and Digital Knowledge Master Degree course at the University of Bologna.

Card image cap
Sara Coppini

Graduated in Philosophy at the University of Bologna. I'm currently attending the Digital Humanities and Digital Knowledge Master Degree course at the same University.