CCRR: Complex Chromosomal Rearrangements Resolver

0. Introduction

Complex Chromosomal Rearrangements Resolver (CCRR) is a self‑contained toolkit that turns whole‑genome sequencing data into an annotated catalogue of complex tumour rearrangements. Packaged in a single Docker image, it installs all dependencies automatically, including six SV callers, five CNV callers, tools for purity–ploidy estimation, and a panel of complex event detectors (ShatterSeek, CTLPScanner, SeismicAmplification, AmpliconArchitect, Starfish, and gGnome via JaBba). A single command runs the full pipeline on tumour/normal BAM files, merges the results into high-confidence consensus SVs and copy number states, infers purity and ploidy, applies the complex event detection tools, and generates publication-ready Circos and track plots.

For users who already have SV and CNV results, a companion web server (https://www.ccrr.life/) provides an interactive interface that supports one-click execution of the same event detection suite. It accepts standard VCF and segment files as input, and also allows custom JSON-formatted SV/CNV data for flexible analysis and visualization. Users receive an interactive dashboard and downloadable result summaries and figures.

The source code is freely available at https://github.com/laslk/CCRR, and the workflow runs seamlessly on any Linux or Windows host with Docker support.

1. Run the Full Pipeline Locally

1.1 Installation
Download
wget -O ccrr1.2.zip https://www.ccrr.life/download_file/ccrr1.2.zip
unzip -q ccrr1.2.zip
Prepare for installation using install.py
python install.py -sequenza -manta -delly -svaba -gridss \
                    -lumpy -soreca -purple -sclust -cnvkit \
                    -ref 'hg19&hg38'

This script will automatically download the dependency data for the tools you selected and build the Dockerfile.

  -sequenza             use sequenza for cn, cellularity and ploidy

  -manta                use manta for sv
  -delly                use delly for sv and cn
  -svaba                use svaba for sv
  -gridss               use gridss for sv
  -lumpy                use lumpy for sv
  -soreca               use soreca for sv
  -purple               use purple for cn
  -sclust               use sclust for cn
  -cnvkit               use cnvkit for cn

  -ref                  hg19 or 'hg19&hg38'

You should select at least one SV tool and one CN tool.If you want to run in the fastest way, you can use only Delly to obtain SV and CN.

Obtain licenses for Mosek and Gurobi

Gurobi: Apply for a WLS Compute Server license and store it in the same directory as the Dockerfile, named gurobi.lic. For more information, visit www.gurobi.com
Mosek: Obtain a Mosek license and store it in the same directory as the Dockerfile, named mosek.lic. For more information, visit www.mosek.com

Build Docker image
docker build --pull --rm --build-arg GITHUB_PAT=[GITHUB_PAT] \
        --build-arg SCRIPT_DIR="/home/0.script" --build-arg TOOL_DIR="/home/1.tools" \
        --build-arg DATABASE_DIR="/home/2.share" --build-arg WORK_DIR="/home/3.wd" \
        -f Dockerfile -t ccrr:v1.2 .

To ensure the installation proceeds correctly, you need to provide a GitHub token GITHUB_PAT.
You can specify these four parameters: SCRIPT_DIR for the script directory, default is /home/0.script; TOOL_DIR for the tool directory, default is /home/1.tools; DATABASE_DIR for the database and mounted shared directory, default is /home/2.share; WORK_DIR for the work directory, default is /home/3.wd.

Run a container
docker run -v $(pwd)/share:/home/2.share -v $(pwd)/wd:/home/3.wd -d -it --name ccrr ccrr:v1.2
docker exec -it ccrr /bin/bash

This command mounts the current directory's share folder to DATABASE_DIR inside the container, creating a shared path between the host and the container.

1.2 Testing and Quick Start
Testing
ccrr -mode test

This will test the necessary environment required by the process with an accompanying small sample data, which may take up to half an hour.

Quick Start
nohup ccrr -mode default \
        -normal [normal.bam] --normal-id [normal-id] \
        -tumor [tumor.bam] --tumor-id [tumor-id] \
        --genome-version hg38 -reference [hg38.fa] \
        -threads 30 -g 200 >log 2>&1 &

This will run the entire process in default mode, allowing for up to 30 threads for possible multi-threaded tasks, and a memory cap of 200GB. Processing a pair of tumor/normal control BAM files, each sized around 106GB, approximately takes 80 hours.

1.3 Usage

Display Help
ccrr --help
Required Parameters
Mode
  -mode {fast,custom,default,test,clear}        choose mode to run
Input and Information

Your input should be a pair of normal/tumor control whole-genome sequencing BAM files and their reference genome. Supported reference genome versions are hg19 and hg38.

  -prefix                       task id  

  -normal NORMAL                normal bam
  --normal-id NORMAL_ID         Identifier for the normal sample, typically from the BAM header  

  -tumor TUMOR                  tumor bam
  --tumor-id TUMOR_ID           Identifier for the tumor sample, typically from the BAM header

  --genome-version {hg19,hg38}  Set the reference, hg19 or hg38
  -reference REFERENCE          reference fq
Optional Parameters
Configuring Multithreading and Available Memory

If not set, the default memory allocation is 8GB, which may not suffice for the memory demands of certain steps. We recommend setting it higher.
Please note that the default number of threads is 8. Some software may not support multithreading acceleration, and for others, there might be a soft cap on the number of threads that can be effectively utilized, meaning that setting a higher number of threads may not result in the expected speed-up.

  -threads THREADS      Set the number of processes if possible
  -g G                  set the amount of available RAM If possible
Selecting Required Tools

In custom mode, you can freely choose which software to use for generating SV and CN data. The built-in software includes:

  -sequenza             use sequenza for cellularity, ploidy and cn

  -delly                use delly for sv and cn 
  -manta                use manta for sv
  -svaba                use svaba for sv
  -gridss               use gridss for sv
  -lumpy                use lumpy for sv
  -soreca               use soreca for sv

  -sclust               use sclust for cn
  -purple               use purple for cn
  -cnvkit               use cnvkit for cn
Setting Quality Filtering

You can conveniently filter the results of each software based on quality before merging:

  --manta-filter MANTA_FILTER               Filter for manta
  --delly-filter DELLY_FILTER               Filter for delly sv
  --delly-cnvsize DELLY_CNVSIZE             min cnv size for delly
  --svaba-filter SVABA_FILTER               Filter for svaba
  --gridss-filter GRIDSS_FILTER             Filter for gridss
  --lumpy-filter LUMPY_FILTER               Filter for lumpy
Merging Method

When results from two different SV callers are adjacent in the genome and the distance between them is less than a specified threshold, they will be considered the same SV. The default threshold is 150bp.

  --sv-threshold SV_THRESHOLD

Select the method for merging results from different SV callers. If you wish to retain only the results supported by all the SV callers used, choose intersection; if you prefer to keep all results from all SV callers without duplicates, choose union; if you want to customize to retain results supported by X or more software tools, select x-or-more, and specify the number in --sv-x. If X is not specified, the default will be 3, meaning that results supported by two or more software tools will be retained.

  --sv-merge-method {intersection,union,x-or-more}
                        Choose a sv merging method: 
                        1. 'intersection': Merges only the SVs that are identified by all SV callers. 
                        2.'union': Merges all SVs identified by any of the SV callers. 
                        2. 'x-or-more': Merges SVs that are identified by at least x SV callers. if only one svcaller is prepared , then this parameter is irrelevant  
  --sv-x {1,2,3,4,5,6}  
                        Specify the x. This argument is required when '--merge-method' is set to 'x-or-more'. Must be among the provided input files. default=3

If you wish to prioritize a specific SV caller, setting --sv-primary-caller will retain all results outputted by it.

  --sv-primary-caller {manta,delly,svaba,gridss,lumpy,soreca}
                        Specify the primary SV caller to keep all of its result.

Setting --cn-threshold allows you to adjust the threshold, defining the maximum allowable distance for determining overlap among copy number change regions from different tools when merging copy number variant analysis results. The default threshold is 5000bp.

  --cn-threshold CN_THRESHOLD
                        threshold for determining cn, defaults to 5000bp
Complex Rearrangement Analysis
-complex COMPLEX      complex rearrangement analysis

--cellularity-ploidy-tool {sequenza,purple}

Complex rearrangement analysis is conducted by default. If you only wish to obtain merged results, you can use -complex False.
You can also specify the tool used for cellularity and ploidy estimation with --cellularity-ploidy-tool, choosing between sequenza (default) and purple. This setting influences tools like JaBba and gGnome.

Output, Rerunning, and History

${WORK_DIR}/[task id]will serve as the working directory, retaining the output results of each part. A summary of the complex rearrangement analysis can be found in ${WORK_DIR}/[task id]/complex/summary.
Once a module is completed, it will be recorded in the ${WORK_DIR}/[task id]/history file. If the process is unexpectedly interrupted, rerunning the entire process will skip the parts that have been successfully executed according to the records in the history file, resuming from the point of interruption.
Of course, you can manually modify this file to skip any steps you wish to bypass.

1.4 Custom Execution

You can run each module step by step according to your analytical needs. For example:

Using svmerge.py to Merge SV Data

You can specify the output results from each SV caller as input files for merging.

  -manta MANTA          manta vcf result
  -delly DELLY          delly vcf result
  -svaba SVABA          svaba vcf result
  -gridss GRIDSS        GRIDSS vcf result
  -lumpy LUMPY          LUMPY vcf result
  -soreca SORECA        soreca result

Filter the output results of each SV caller based on quality.

  --manta-filter MANTA_FILTER
                        Filter threshold for manta
  --delly-filter DELLY_FILTER
                        Filter threshold for delly
  --svaba-filter SVABA_FILTER
                        Filter threshold for svaba
  --gridss-filter GRIDSS_FILTER
                        Filter threshold for GRIDSS
  --lumpy-filter LUMPY_FILTER
                        Filter threshold for LUMPY

Determine thresholds, merging methods, and specify a trusted SV caller as described previously.

  --threshold THRESHOLD
                        threshold for determination, defaults to 150bp

  --merge-method {intersection,union,x-or-more}
                        Choose a merging method: 1. 'intersection': Merges only the SVs that are identified by all SV callers. 
                        2. 'union': Merges all SVs identified by any of the SV callers. 
                        3.'x-or-more': Merges SVs that are identified by at least x SV callers. if only one svcaller is prepared , then this parameter is irrelevant

  --primary-caller {None,manta,delly,svaba,gridss,lumpy,soreca}
                        Specify the primary SV caller to keep all of its result.

  -x {1,2,3,4,5,6}      Specify the x. This argument is required when '--merge-method' is set to 'x-or-more'.
                        Must be among the provided input files.

Set the output path and enable multi-process execution.

  -o O                  output path
  -t T                  Set the number of processes
Use consensus_cn.py to merge CN data.
python ${SCRIPT_DIR}/consensus_cn.py  \
    -sclust SCLUST -delly DELLY -purple PURPLE -cnvkit CNVKIT \
    -ref hg19 -gender male \
    -o OUT 

Parameters

  -sclust SCLUST        sclust cn result
  -delly DELLY          delly cn result
  -purple PURPLE        purple cn result
  -cnvkit CNVKIT        cnvkit cn result
  -sequenza SEQUENZA    sequenza cn result

  --threshold THRESHOLD
                        threshold for determination, defaults to 5000bp

  -o O                  output path
  -ref REF              hg19 or hg38
Use complex.py to analyze complex rearrangements.
python ${SCRIPT_DIR}/complex.py  -prefix task_id \
        --tumor-id example -sv SV -cn CN \
        --genome-version hg19 -gender male \
        -shatterseek -starfish -gGnome -SA  -ctlpscanner \
        -threads 30 -g 200

Required inputs, the format of SV and CN files as shown in the examples.

https://www.ccrr.life/static/examplefile/custom_sv.bed
https://www.ccrr.life/static/examplefile/custom_cn.bed
  -prefix               task id
  --tumor-id TUMOR_ID
  -sv SV                sv input
  -cn CN                cn input
  --genome-version GENOME_VERSION
                        Set the reference, hg19 or hg38

Select the tools for complex rearrangement analysis.

  -shatterseek          use shatterseek
  -starfish             use starfish
  -gGnome               use jabba and gGnome
  -SA                   use Seismic Amplification
  -ctlpscanner          use CTLPscanner

The AmpliconArchitect requires an input of BAM files.

  -AA                           use Amplicon Architect

  -normal NORMAL                normal bam
  --normal-id NORMAL_ID

  -tumor TUMOR                  tumor bam
  --tumor-id TUMOR_ID

Set the available memory and number of threads.

  -threads THREADS      Set the number of processes if possible
  -g G                  set the amount of available RAM If possible
1.5 Output
Summary: {WORK_DIR}/{PREFIX}/complex/summary.png

A visual summary of CN, SV integration, and analysis results of various complex rearrangements generated by the CCRR workflow.
input
The tracks, from outer to inner, display:

  1. Chromosomes:
    Shows the start and end points of chromosomal regions and the centromeres.

  2. CN:
    Regional colors indicate copy number gains (red) or losses (green);
    a black solid line represents a smoothed curve showing actual copy numbers,
    with a straight black line representing the default normal copy number state (CN=2).

  3. Shatterseek:
    Highlights chromosomal shatter regions with high confidence (orange) and low confidence (yellow)
    (criteria do not include statistical validation).

  4. CTLPScanner:
    Marks Chromothripsis-like Pattern areas, with region colors representing the log likelihood ratio (lg(LR) ≥ 5).

  5. Seismic Amplification:
    Indicates seismic amplification event areas (green).

  6. Starfish:
    Highlights complex genomic rearrangement areas (cyan).

  7. gGnome:
    Shows various complex event areas (details available in gGnome results).

  8. AmpliconArchitect:
    Marks ecDNA (blue), linear amplification (green), and BFB (yellow) areas
    (not available on the web).

  9. SV:
    Indicates different types of structural variation:

    • deletions (DEL, blue)
    • inversions (INV, green)
    • duplications (DUP, red)
    • translocations (TRA, brown)
CN merge:

Merge result:{WORK_DIR}/{PREFIX}/cnmerge/consensus_cn.bed
Segment count plot: {WORK_DIR}/{PREFIX}/cnmerge/segment_count.pdf

A bar plot showing the count of copy number segments across different length intervals.

Bias and volatility plot: {WORK_DIR}/{PREFIX}/cnmerge/Bias_and_volatility_for_CN_all_ranges.pdf

This figure shows the distribution of bias and volatility for each tool across different region lengths.
Bias reflects systematic deviation from the consensus copy number, while volatility captures the magnitude of variation.
Both are length-weighted and log-scaled to allow fair comparison across tools.

SV Merge

Merge result: {WORK_DIR}/{PREFIX}/svmerge/sv_merge.bed
SV caller consensus (Upset plot): {WORK_DIR}/{PREFIX}/svmerge/sv_merged.pdf

An Upset plot illustrating the overlap and consensus of structural variant calls among different SV tools.

2. CCRR Web Services

2.1 Start

The web services support multiple input options. You may begin with any of the following:

2.2 SV Input

fig1
By clicking on "From tools," you can upload results from various structural variant analysis tools.
fig2
You have the option to directly input the corresponding files, with examples available for review by clicking on the respective 'example' links. These include:

Delly: delly_example.sv.somatic.pre.vcf
Manta: manta_example.somaticSV.vcf
Gridss: gridss_example.gripss.filtered.vcf (processed with GRIPSS)
Lumpy: lumpy_example.gt.vcf
SvABA: svaba_example.somatic.sv.vcf
Soreca: soreca_example_unsnarl.txt
These sample files, derived from the public dataset SRR2020636, serve only as format references and hold no analytical significance.

Upload Options:
You can upload results from one to six different structural variant analysis tools. If only one file is uploaded, we will convert its format and proceed with the complex structural variant analysis. If two or more files are uploaded, they will first be merged.
Custom Data:
If you wish to use your own structural variant data, you can click on "From Custom" to upload your customized data.

The format for custom structural variant data should be as follows:

Formatting Requirements:

The example data available via the Example link is sourced from the PCAWG consensus public structural variant data (source link), specifically from the dataset 0c0038ff-6cc4-b0b0-e050-11ac0d483d73, which can be used for demonstration analyses. You can click the "Load Example" button to load the sample file.

2.3 CN Input

By clicking on "From tools", you can upload copy number analysis results from various tools.

You have the option to directly input the corresponding files, with examples available for review by clicking on the respective "example" links. These include:

These sample files, derived from the public dataset SRR2020636, serve only as format references and hold no analytical significance.

Upload Options:
You can upload results from one to four different copy number variant analysis tools.

Custom Data:
If you wish to use your own copy number data, you can click on "From Custom" to upload your customized data.

The format for custom copy number data should be as follows:

Formatting Requirements:

The example data available via the Example link is sourced from the PCAWG consensus public copy number variant data (source link), specifically from the dataset 0c0038ff-6cc4-b0b0-e050-11ac0d483d73, which can be used for demonstration analyses.

2.4 Parameters

Click on "options" to expand the options card and customize parameters for the analysis. The parameters include:

input

2.5 Starting the Analysis

Ensure that you have:

Then, click "start". You will see a waiting page indicating that your analysis is either in queue or in progress. Once the analysis is complete, you will be redirected to the results page.

2.6 Interactive Result Visualization

This is an interactive web interface designed for exploring complex genomic rearrangement results through an intuitive Circos-based view.
The result page will automatically load the analysis results based on the files you uploaded.
If you wish to explore results interactively using your own data or view outputs from a local CCRR pipeline run, you can visit https://www.ccrr.life/customize-data, where you are allowed to upload your own .json result file.

input

Left Panel: Control Panel

The control panel on the left side of the interface provides key functionalities:

Center Panel: Circos Plot

The central Circos plot offers a dynamic visualization of genome-wide CN and SV integration, with multiple inner and outer tracks showing different types of variation and complex events.

Interactive features:

Nearly all elements in the Circos plot are interactive, including:

Right Panel: Detailed Information Popup

When an element in the plot is clicked, a detailed popup appears on the right, showing:

This allows uses to nspect specific regions or events in-depth and trace their origin or biological rielrevance.

Customization Options

Users can personalize the visualization via the control panel:

This flexible interface supports efficient exploration of complex SV and CNV landscapes in tumor genomes or other rearrangement-rich datasets.

3. Step-by-Step Example

3.1 Installation and Data Preparation

Create a Working Directory

mkdir ccrr1.2
cd ccrr1.2

Download CCRR

wget -O ccrr1.2.zip https://www.ccrr.life/download_file/ccrr1.2.zip
unzip -q ccrr1.2.zip

Download Dependencies and Create a Dockerfile

python install.py -sequenza -manta -delly -svaba -gridss -lumpy -soreca -purple -sclust -cnvkit -ref 'hg19&hg38'

Prepare licenses for Mosek and Gurobi

cp /path/to/gurobi.lic ./gurobi.lic
cp /path/to/mosek.lic ./mosek.lic

Prepare Input Data

We use data from the breast cancer cell line HCC1395/HCC1395BL, part of a multi-center study (DOI: 10.1186/s13059-022-02816-6). The BAM files were downloaded from: ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/seqc/Somatic_Mutation_WG/data/WGS/

Files are placed at:

./share/hcc1395/WGS_FD_T_1.bam       # Tumor
./share/hcc1395/WGS_FD_N_1.bam       # Normal
./share/hcc1395/WGS_FD_T_1.bam.bai   # Index
./share/hcc1395/WGS_FD_N_1.bam.bai   # Index

The corresponding reference genome used for alignment is in

./share/data/ref/GRCh38.d1

Build Docker image

docker build --pull --rm --build-arg GITHUB_PAT=[GITHUB_PAT] --build-arg SCRIPT_DIR="/home/0.script" --build-arg TOOL_DIR="/home/1.tools"  --build-arg DATABASE_DIR="/home/2.share" --build-arg WORK_DIR="/home/3.wd" -f Dockerfile -t ccrr:v1.2 .

Run a container

docker run -v $(pwd)/share:/home/2.share -v $(pwd)/wd:/home/3.wd -d -it --name ccrr ccrr:v1.2
docker exec -it ccrr /bin/bash
3.2 Run the Pipeline Locally

Here, we selected all tools except SoReCa for this analysis.

nohup ccrr -mode custom -prefix hcc1395 -normal /home/2.share/hcc1395/WGS_FD_N_1.bam --normal-id WGS_FD_N_1 -tumor /home/2.share/hcc1395/WGS_FD_T_1.bam --tumor-id WGS_FD_T_1 --genome-version hg38 -reference /home/2.share/data/ref/GRCh38.d1/GRCh38.d1.vd1.fa -cnvkit -delly -manta -lumpy -gridss -svaba -purple -sclust  --cellularity-ploidy-tool sequenza  -threads 30 -g 200 >log 2>&1 &

The analysis completes in about 100 hours.

In /home/3.wd/hcc1395/svmerge, you can find the SV results from individual tools, the merged SV calls sv_merged.bed, and an UpSet plot sv_merged.pdf illustrating the overlaps among different SV datasets.

input

In /home/3.wd/hcc1395/cnmerge, you will find the CN analysis results from individual tools, as well as the merged consensus result:consensus_cn.bed

input

segment_count.pdf, counts of CNV segments by length

input

Bias_and_volatility_for_CN_all_ranges.pdf: shows bias (deviation from consensus) and volatility (variation across tools) across segment sizes.

In /home/3.wd/hcc1395/complex, you will find the results from six complex rearrangement analysis tools, a summary figure summary.png

input

and a JSON file hcc1395circos.json for web-based visualization.

3.3 Use the web service to explore the results in detail

To explore the results interactively, go to https://www.ccrr.life/ and click "Customize Data" in the top-right menu to access the custom upload page.

input

On the Customize Data page, click "Choose a file" to select hcc1395circos.json, then click "Upload & Render". The Circos plot will be rendered after a short loading period.

input

Clicking on any element in the plot reveals detailed annotations and associated information.

input

To focus on regions where multiple tools show consensus, enter the coordinates chr3:57825870-130091239;chr6:10307610-122158017 into the "Genome Region" field in the control panel, then click "Add". This will zoom in and display a more detailed and clearer view of the selected regions.

input

To save the current visualization, click "Export SVG" to export it as a scalable vector graphic. If needed, click "Reset" to clear custom regions and revert the view to its default state.

3.4 Upload Files for Analysis

We uploaded the locally generated results from Delly, Manta, Gridss, Lumpy, Purple, and CNVkit.
After uploading the files, select the reference genome as hg38, then click "Start" to begin the analysis.

input

The system will redirect to a waiting page. After approximately 20 minutes, it will automatically jump to the results page.

input