What is CAGE technology, the cap-analysis of gene expression?

CAGE is a method that helps researchers profile RNA expression and identify promoters and regulatory elements. Among the regulatory elements, there are also enhancers that are very important, and recent studies show that they are the key elements to regulated genome output.

How did you become involved in the development of CAGE technology for transcriptome analysis?

Although the genome would provide a lot of useful information, we could foresee at the time that it would be very difficult to identify all the regulatory elements, such as the locations of the promoters and enhancers. At that time, we were thinking that if we could sequence at very high throughput, we could map all the initiation starting sites of all the RNAs. If we could do that, we could automatically identify all the promoters and regulatory elements as well. At the time, only Sanger sequencing was available. Sanger sequencing was a fairly expensive way to do sequencing. We decided to modify our existing protocol, which was based on a technique called cap-trapper, to make it a high-throughput method. We decided to cut many fragment sequences, ligate everything together and use this for Sanger sequencing. With this protocol, we developed the first version of CAGE that we used extensively in the FANTOM 3 project. Then we kept developing technologies to match next-generation sequencing.

What are the key advantages of choosing CAGE over other methods?

The key advantages are that we do not only profile RNA expression, but we also identify non-coding RNAs as well. Non-coding RNAs are a new class of RNAs that are not making proteins. Using CAGE, we can identify messenger RNA and non-coding RNAs at the same time. For all of these RNAs, we also identify promoters and regulatory elements. Identifying promoters is very important because we can look at a promoter’s sequence and we can bioinformatically analyze the promoter sequence and identify shorter sequences called transcription factor binding sites. Those elements are responsible for attracting and binding elements that are called transcription factors, and they are very important for regulating gene activity. In essence, in one single CAGE experiment, we can analyze the whole genome. For all promoters, we can bioinformatically infer all transcription factor binding sites and promoter networks. We do not only measure gene expression, but we can also identify the network that regulates gene expression. We provide much more information at a much higher level in comparison to just expression analysis.

What kinds of research have you done using CAGE analysis? Has it helped improve the results of your research by being faster and more in-depth than other methods?

It has provided a different type of data that we would not have had without using CAGE. We have been using CAGE in the FANTOM project. The FANTOM project means Functional Annotation of Mammalian. FANTOM is a very long project that started in the year 2000, with stages 1 through 6. We started to use CAGE from FANTOM 3, where we made the first map of promoters with mice and human genes. Also using CAGE, we identified the existence of non-coding RNA and in particular anti-sense RNA. Many RNAs have anti-sense transcription and CAGE was instrumental in identifying those RNAs. After the first map of promoters by CAGE, we continued to use CAGE extensively in the FANTOM 4 project. In FANTOM 4, we introduced one analytical tool that is called motif activity response analysis. This method helps associate promoter activity and transcription factor binding sites to recreate and infer the network of transcription factors that are responsible for transcribing genes. This was published in Nature Genetics in 2009.

Another interesting paper in that FANTOM project was the identification of expression of retrotransposon elements. This was possible because we used CAGE sequencing and we could see there were retrotransposon elements, which are considered junk of the genome. They were not just sleeping and being repressed, but they were actively transcribed at very specific stages. For example, there is a family of them that is very important and active during embryonic development of both the mouse and human. This was clearly identified by using CAGE.

We started the FANTOM 5 project from 2010, and it is still running. In FANTOM 5, we made a second, very large map of promoter elements. This is very extensively validated across many cell types, such as the primary cells at developmental stages of diseases such as cancer. We have mapped many regulatory elements of important biological samples, particularly diseases. This is the broadest reference of transcription ever made.

The second important achievement for the FANTOM 5 is that we identified so-called enhancer RNAs. These are long-range regulatory elements present in human and mouse genomes. They also produce RNAs, most of which are non-coding RNAs that don’t make any protein. Some of them have some regulatory functions. We have discovered something like 55,000 of them. With this information, we can identify enhancers for specific stages. For example, we can find enhancers specific to cancer or some other disease, or an enhancer that is active in a normal state. In this way, we can study biology much better.

Also, we have an enhancer that is important for stem cells or iPS cells. We can use this type of information either to produce better cells for cell therapy, creating cells in-vitro, and transferring them into patients. This type of information is very important. If you want to monitor this information for diseases such as cancer, we can monitor specific enhancer RNAs that are usually highly expressive for cancer. This may happen in the future of clinical studies to identify and characterize the targets much better.

Microarrays, full length RNA-seq and CAGE sequencing all perform genome-wide gene expression analysis. How do researchers choose between these technologies when they need transcriptome analysis?

Microarrays were the first version of the high throughput of expression. Microarrays have been used since the late 90s. Microarrays are good for the expression of isoforms of genes. There are some microarrays that are targeted to multiple variants and splicing isoforms, but those are not necessarily the most widely used. The problem with microarrays is that they have fairly compact dynamic ranges so they cannot analyze a very broad dynamic range. They cannot capture RNA with very low expression. Sometimes they show some cross-hybridization so similar genes may give signals in the wrong place. Often it is not easy to interpret the data.

The most commonly used technique is RNA-sequencing. In principle, this is a very good method and is broadly used. However, even if RNA-sequencing captures a lot of splicing forms, it is still difficult to integrate everything in separate gene models because the data contains a large mix of many things that convolute different isoforms. For this, I think RNA-seq is not achieving its highest potential, particularly using the current next generation sequencing that uses short reads.

At Riken, we focus on CAGE because we can achieve this higher resolution even when we sequence less deeply, so our sequencing is much cheaper than RNA sequencing. RNA sequencing requires hundreds of millions of reads, which starts to become fairly expensive. With CAGE, we get equivalent coverage of the gene and promoters by sequencing only ten million reads, so essentially, we can save quite a bit on sequencing costs. Usually here in the lab, we use CAGE by default when we want to study the differentiation of cells and see how activity changes at the promoter level. We also do a little bit of RNA sequencing because it helps us figure out the types of RNAs that are expressed, but we don’t believe that RNA-sequencing can help us deeply understand all the RNA isoforms. Future studies and new technology will be needed to understand the expression of different variants of RNAs.

Are there any aspects of CAGE that need improvement?

As with any technology, we always have to work to make it better. In the case of CAGE, there is an evolution of sequencing instruments to provide higher throughput. CAGE has to match those high throughput instruments, so we need to prepare more samples in parallel, we need multiplexing, and we need to ensure that CAGE is working equally well anywhere in the world when we compare data. We need standardization, simplification of protocol, and multiplexing. Multiplexing means preparing many samples in a single batch and then being able to test many samples.

We also need to scale down the number of samples used, so miniaturization is a key word for the future development of CAGE and similar technologies.

What are your expectations for CAGE’s users in the future?

If the technology is reproducible and it can be distributed broadly, together with good bioinformatics, I expect the technology will be used as a routine way to analyze what the transcriptome is doing. I believe we will have a growing number of databases and bioinformatics tools that will handle all the data to help us standardize. One of the challenges is if we bring this technology to hospitals in the future, we will need to provide simple explanations for a lot of data. The challenge for analysis is to standardize and provide an interpretation that does not require a Ph.D. in bioinformatics but only requires a basic background in life science. I think this kind of simplification has to happen for all of these technologies.