1 Introduction

Previous work has shown that there are many clustered regularly interspaced short palindromic repeats (CRISPRs) in bacteria and archaea genomes. CRISPR loci are usually composed of short and highly conserved repeats of 21–48 bp, which are separated by 26–72-bp non-repetitive spacer sequences. CRISPR identifies the target gene through these spacer sequences and CRISPR-associated (Cas) protein families promote cleavage of the target genes. This forms an immune defense system of prokaryote cell and provides protection against the invasion of foreign genetic material to maintain genome stability and integrity. In recent years, the rapid progress in developing CRISPR/Cas9 system into a set of tools for cell and molecular biology research has been remarkable, likely due to the simplicity, high efficiency and versatility. The system is by far the most user-friendly for gene editing. Therefore, it is widely used in the direct genetic transformation of human cells, as well as mouse, zebra fish, rice, etc. Furthermore, this system shows a broad prospect of clinical application in the treatment of acute lymphocytic leukemia, hereditary vision loss disease, acquired immune deficiency syndrome, etc., which has made it the most popular technology in the field of gene editing. CRISPR has emerged in Science’s Breakthrough of the year thrice, in 2012, 2013, and 2015, the first two times as a second place together with other gene editing techniques. However, it broke away from the others, manifesting its spectacular achievements in 2015. In order to understand the path of knowledge evolution, the dynamic process and characteristics in CRISPR/Cas9 research, this paper uses the information visualization software CiteSpaceV to analyze the recently published literature on CRISPR/Cas9 and draw knowledge maps.

2 Methods

2.1 Data retrieval

We retrieved research papers in the field of gene editing from the Web of Science (WOS) database by keywords, and the search formula is TS (Topic)=CRISPR/Cas9 or CRISPR or CAS9. The retrieval time is from 2002 to 2015 (since CRISPR was officially named in 2002), the publication type only includes Article and Review, and the language is limited to English. We obtained 1908 papers using CiteSpaceV built-in deduplication function. Each paper contains 17 fields: PT (publication type), AU (authors), TI (document title), SO (publication name), DE (author keywords), ID (keywords plus®), AB (abstract), C1 (author address), CR (cited references), NR (cited reference count), TC (times cited), PY (year published), VL (volume), PG (page count), SC (subject category), UT (unique article identifier), and ER (end of record), which serves as a formatted file.

2.2 Analysis methods

CiteSpace is visual information analysis software based on literature co-citation analysis theory and pathfinder network scaling, which was developed by Prof. Chao-mei CHEN built on Java platform (CiteSpaceV, Version 4.4. R1 (32-bit); http://cluster.cis.drexel.edu/∼cchen/citespace) (Chen, 2004; 2006).

CiteSpaceV presents the citation history of a paper as a tree ring of the node. The color of a citation ring denotes the time of corresponding citations. The radius of the node indicates the number of total citations. The area of circles represents the number of publications. The thin lines show the cooperation relationships between countries or institutions. Cluster maps and time-zone views can reveal the development trend of a discipline or field of knowledge in a specific period, and this can be used to present the evolution process of the research fronts and hotspots of related fields (Chen, 2012; Chen and Leydesdorff, 2014).

CiteSpace software adopts several structural metrics of co-citation networks, which include betweenness centrality, modularity, and silhouette (Chen et al., 2010). The betweenness centrality metric measures the extent of one node in the middle of a path that connects other nodes in the network (Brandes, 2001; Freeman, 1977). High betweenness centrality values recognize potentially suprema scientific literature (Chen, 2005). The modularity (Q) measures the strength of a network divided into independent communities or modules. The modularity score ranges from 0 to 1, and usually Q>0.3 means that the division of the community structure is conspicuous (Newman, 2006). Mean Silhouette (S) metric shows which nodes are well within their clusters, and which ones are only somewhere in between clusters, and generally S>0.5 means that the cluster is considered logical (Rousseeuw, 1987).

Cluster labels are chose from noun phrases and index terms of citing articles of each cluster. These terms are sorted by term frequency-inverse document frequency (TF-IDF) algorithm, which is used to evaluate the importance of the word in an article. Labels selected by TF-IDF weightiness prefer to represent the most salient aspect of a cluster and emphasize the research mainstream (Chen et al., 2010).

3 Results and analysis

3.1 Brief history of CRISPR research

The story about CRISPR/Cas9 system began with the discovery of a highly homologous sequence clustered in the Escherichia coli genome (Ishino et al., 1987; Nakata et al., 1989). In 2002, Jansen and colleagues from Utrecht University (Utrecht, the Netherlands) officially named it Clustered Regularly Interspaced Short Palindromic Repeats and created the CRISPR acronym. Furthermore, CRISPR-associated (cas) genes (cas1, cas2, cas3, cas4) were identified to be well conserved and usually adjacent to the repeat elements. However, the biological function of CRISPR was still unclear (Mojica et al., 2000; Jansen et al., 2002).

The incubation period is from 2002 to 2006 when our understanding of CRISPR and associated genes was refined. Even though the annual number of research papers was few, some remarkable research results appeared in 2005. Bioinformatics analysis revealed attractive picture, wherein the CRISPR loci of three different types of CRISPR systems (types I–III) might specifically defense against invading foreign genetic elements, such as bacteriophages and plasmids, with the constitutive spacers. In addition, some families of cas genes are extremely conserved. It suggests that CRISPR sequences may be associated with bacterial immunity (Bolotin et al., 2005; Haft et al., 2005; Mojica et al., 2005).

The germination period is from 2007 to 2011 when a series of artful experimentation has provided fresh viewpoints into CRISPR functional mechanism. Among of them the significant findings include: Barrangou et al. (2007) from the food ingredients company Danisco in Copenhagen, Denmark, found that bacteria integrated new sequences originated from phage genomic sequences. CRISPR, together with associated cas genes, provided immunity against phages, and resistance specificity was controlled by spacer-phage sequence similarity. After the acquisition of Danisco, DuPont used this technique to breed a more resistant strain for industrial production. Marraffini and Sontheimer (2008) found that horizontal gene transfer (HGT) in bacteria and archaea happens through phage transduction or conjugation. CRISPR loci can confine the spread of antibiotic resistance by counteracting multiple channel of HGT. Research on CRISPR/Cas function aroused further interest in the study of its mechanism, and total cites of this paper were more than 300. Brouns et al. (2008) first proved that the guide RNA in the CRISPR system is the critical factor for bacterial immunity to viruses, and the CRISPR-associated complex is the key role in the antiviral defense.

Year 2012–2013 was a milestone in the progress of CRISPR/Cas9 research. Representative research results are: Jinek et al. (2012) published a highlight article with total cites up to 661 and a betweenness centrality of 0.45 on Science journal. The paper reveals a family of endonucleases that uses dual RNAs for site-specific DNA cleavage and the potential to develop an RNA-dependent gene editing system. Mali et al. (2013) reported that they successfully engineered the type II bacterial CRISPR system to function with custom guide RNA (gRNA) in human cells. The results demonstrated the possibility of CRISPR-mediated gene targeting for RNA guided, powerful, and multipurpose mammalian genome editing. The total cites of this paper is 669, ranking the second highest. Cong et al. (2013) concluded that short RNAs might guide Cas9 nucleases to trigger precise cleavage at endogenous genomic loci in mammalian cells. Multiple guide sequences permute into single CRISPR loci to enable simultaneous editing of several target sites within the mammalian genome, demonstrating easy programmability and wide applicability of the technology. The paper has the highest total cites (714) in the field.

So far, scientists have successfully applied the technology on distinct species, including human cells, animals, and plants, indicating the simplicity and efficiency of the CRISPR/Cas9 system. The power of CRISPR systems will undoubtedly transform life science research and spur the development of novel cell genetics and molecular biology. Fig. 1 shows the number changes of research papers in the field of CRISPR/Cas9 collected by the WOS database from 2002 to 2015 (up to December 2015).

Fig. 1
figure 1

Number of CRISPR/Cas9 research papers (Data source: Web of Science, up to December 2015)

3.2 Analysis of countries’ (districts’) and institutions’ cooperation

By drawing the hybrid network map of cooperative relationships, we analyze cooperative relationships between countries (districts) and institutions, research strength, and other information. We set the node type as Country and Institution. We cluster it and obtain a mixed network map consisting of 147 nodes and 197 edges (a density of 0.0184), as shown in Fig. 2. The United States has the highest number of publications, which is far more than those of other countries and accounts for more than half of the total with total cites up to 3098. This is followed by China (691) ranking the second, then Germany (434), Japan (381), and France (346). The citation number of the University of California, Berkeley, is 250, ranking first among research institutions. Harvard University has 246, Massachusetts Institute of Technology (MIT) 181, and the Chinese Academy of Sciences 128.

Fig. 2
figure 2

National and institutional hybrid network map

Most countries have cooperative relationships with the United States, which is at the core of this field. The centrality score of United States is 0.64, much higher than those of other countries, France (0.28), and then Canada (0.21) and England (0.24) follow. The centrality of China is 0.09, indicating that the international influence in this field of China is relatively weak. As a research powerhouse of gene editing technology, the United States statistics are distributed among various universities. There are six American universities in the Top 10 research institutions, including Harvard University, University of California (UC) Berkeley, MIT, and University of Georgia. Chinese research efforts are mainly in the Chinese Academy of Sciences, which has the highest centrality in the institutions as shown in Table 1.

Table 1 A partial list of the countries and institutions in the hybrid network map (Top 10 citations)

3.3 Statistical analysis of main authors

For this the process is: retrieve main authors in the field of gene editing from WOS database, and arrange Top 20 authors by the number of research papers, as shown in Table 2. K.S. Makarova from Biotechnology Information Center of National Institutes of Health (NIH) and R.M. Terns from University of Georgia both have the highest betweenness centrality of 0.17. Makarova et al. (2006) found that CRISPR sequences might be associated with immune protection mechanisms of bacteria through comparative genomics study, which is the important turning point of gene editing research. Hale et al. (2009) mainly studied a CRISPR/Cas effector complex that is comprised of small invader-targeting RNAs from the CRISPR loci and the related Cas proteins, which cleave complementary target RNAs at a fixed distance. The results indicated that prokaryotes possess a unique RNA silencing system that functions by homology-dependent cleavage of heterogenous RNAs. R. Barrangou from North Carolina State University has the most papers at 54, with total cites of 4978, field H index of 25, and a betweenness centrality of 0.06. His main contribution is to verify that the adaptive immunity of bacteria related to spacer sequence, which lays the foundation for further study (Barrangou et al., 2007). F. Zhang from Broad Institute of MIT and Harvard published 31 papers, whose total cites, highest cites of a single paper, and average cites all are the highest, 6041, 1629, and 194.87, respectively, and has the field H index of 21. F. Zhang first applied CRISPR technology to the specific removal of two genes in human cells (Cong et al., 2013) and opened paradigm application of gene editing technology in eukaryotic cells.

Table 2 Statistical analysis of main authors in gene editing research field

4 Research frontier analysis

Document co-citation analysis (DCA) studies are among the most commonly used methods in quantitative studies of science (Chen, 2004; 2006). We set network nodes as Cited References and select the Top 35 most cited or occurring items from each slice, with other parameters remaining the same with above, and then obtain an automatic cluster map. Extract labels in citing literature of citation clusters, thereby characterizing the research fronts in the field of gene editing technology. The co-cited literature in the hybrid clustering map of this paper, 11 co-citation clusters are composed of 162 nodes and 205 connection lines, with a Q value of 0.7506 and S value of 0.5644, indicating that the clustering structure is significant and efficiency and reliability are within reasonable limits, as shown in Fig. 3.

Fig. 3
figure 3

An overview of the co-citation networks

The network divided into 11 co-citation clusters and they are labeled by index terms from their own citers. The largest four clusters are summarized in the co-cited literature cluster map, as shown in Table 3. The largest cluster (#0) has 26 members and a silhouette value of 0.796. It is labeled as experimental definition | new spacers by TF-IDF. CRISPR system is a bacterial acquired immune system. After the first phage infection, bacteria obtain the phage DNA fragment and integrate it into its own genome as a formation of memory. Stern et al. (2010) published a research paper in Trends in Genetics, in which citing literature accounts for 69% of the cited literature in the cluster. Studies show that the CRISPR is the key adaptive defense system of bacterial and archaeal genomes to resist phage infections.

Table 3 Main clusters and label cluster information of the co-cited literature map

The second largest cluster (#1) has 21 members and a silhouette value of 0.772. It is labeled with RNA-directed adaptive immunity by TF-IDF. Karginov and Hannon (2010) published a review paper in Molecular Cell, accounts for 67% of cited literature in the cluster. This paper summarized the CRISPR/Cas system and its mechanism found in bacteria and archaea, and compared it with the gene silencing mechanism mediated by small RNA molecules in eukaryotic cells.

The third largest cluster (#2) has 18 members and a silhouette value of 0.984. It is labeled as development | protein families by TF-IDF. The paper with the most citing literature in this cluster is about the proteomics structure and function research of psychrophilic methanogens (Saunders et al., 2005), and pointed out that the genome of psychrophilic methanogens presents a similar CRISPR locus and it may be related to the gene bidirectional metastasis of bacteria and archaea.

The fourth largest cluster (#3) has 16 members and a silhouette value of 0.937. It is labeled as protospacer | reversion by TF-IDF. The paper published in Science in 2012 is an important breakthrough in this field, which clarified the mechanism of the CRISPR/Cas9 system (Jinek et al., 2012). Cas9 is a 160-kD protein that can be utilized for RNA recognition and cleavage of target DNA. The paper published in Genes & Developement by Malina et al. (2013), refers to 56% of the literature in the cluster. This paper showed that researchers used CRISPR/Cas9 to target cut p53 gene in order to prove its potential use in the construction of mouse cancer models.

5 Summary and perspectives

The research history of the CRISPR/Cas9 system amply shows the importance of interdisciplinarity. Interdisciplinary points are usually new scientific growth points and new scientific fronts, which are the most likely to lead to significant scientific breakthroughs. CRISPR’s research is composed of the disciplines of microbiology, food science, bioinformatics, comparative genomics, clinical medicine, etc., and especially bioinformatics studies show that CRISPR sequences are complementary to phage DNA sequences, which suggests that it may relate to bacterial immune protection and points out the direction for subsequent function studies. Subsequently, researchers will carry out extensive studies in a variety of model organisms, especially clinical research related to human cells and major diseases, making CRISPR be one of the most popular methods of biotechnology and medicine.

The gene editing technology of CRISPR/Cas9 needs to be studied in greater depth. Though in the past few years an explosion has been seen in our understanding of the CRISPR/Cas9 system, many issues remain unresolved. Using co-cited literature analysis, we demonstrated the research fronts in the field, and the largest cluster is the study on the CRISPR/Cas9’s mechanism. There are scientific issues worthy of further study. Why the CRISPR locus only exists in some bacteria and how bacteria and bacteriophage co-evolve? Since the presence of more serious off-target effects in the mammalian cell’s gene editing process, what is the mechanism by which CRISPR/Cas9 accurately identifies and cuts the heterologous genetic material in bacterial cells? What is the mechanism to precisely regulate Cas9 protein’s expression and jointly assemble them with singleguide RNA (sgRNA) into a functional complex?

Proposing significantly original scientific issues remains the dilemma we face. China obtains the second place in number of papers published in the field of gene editing, with betweenness centrality of 0.09, far behind in the number of personal papers and highly cited papers, indicating that Chinese research forces in this field are dispersed and have little influence. Many scientists are good at tracking international hot spots, following scientific ideas of others and doing some expansion of work in accordance with existing theories. However, because of a lack of original thought, they rarely propose significant scientific issues and thus it is difficult to make influential achievements. Physicist Albert Einstein quotes, “The mere formulation of a problem is far more often essential than its solution, which may be merely a matter of mathematical or experimental skill.”

Compliance with ethics guidelines

Quan-sheng DU, Jie CUI, Chun-jie ZHANG, and Ke HE declare that they have no conflict of interest.

This article does not contain any studies with human or animal subjects performed by any of the authors.