Importing • hgnc

The HUGO Gene Nomenclature Committee (HGNC) is a committee of the Human Genome Organisation (HUGO) that sets the standards for human gene nomenclature.

The HGNC approves a unique and meaningful name for every known human gene, based on a group of experts. In addition, the HGNC also provides the mapping between gene symbols to gene entries in other popular databases or resources: the HGNC complete gene set.

Use the function import_hgnc_dataset() to import the latest HGNC complete gene data set directly into R. This downloads the latest archive from https://www.genenames.org/download/archive/monthly/tsv/.

(hgnc_dataset <- import_hgnc_dataset())
#> # A tibble: 44,117 × 55
#>    hgnc_id    hgnc_id2 symbol    name     locus_group locus_type status location
#>    <chr>         <int> <chr>     <chr>    <chr>       <chr>      <chr>  <chr>   
#>  1 HGNC:5            5 A1BG      alpha-1… protein-co… gene with… Appro… 19q13.43
#>  2 HGNC:37133    37133 A1BG-AS1  A1BG an… non-coding… RNA, long… Appro… 19q13.43
#>  3 HGNC:24086    24086 A1CF      APOBEC1… protein-co… gene with… Appro… 10q11.23
#>  4 HGNC:7            7 A2M       alpha-2… protein-co… gene with… Appro… 12p13.31
#>  5 HGNC:27057    27057 A2M-AS1   A2M ant… non-coding… RNA, long… Appro… 12p13.31
#>  6 HGNC:23336    23336 A2ML1     alpha-2… protein-co… gene with… Appro… 12p13.31
#>  7 HGNC:41022    41022 A2ML1-AS1 A2ML1 a… non-coding… RNA, long… Appro… 12p13.31
#>  8 HGNC:41523    41523 A2ML1-AS2 A2ML1 a… non-coding… RNA, long… Appro… 12p13.31
#>  9 HGNC:8            8 A2MP1     alpha-2… pseudogene  pseudogene Appro… 12p13.31
#> 10 HGNC:30005    30005 A3GALT2   alpha 1… protein-co… gene with… Appro… 1p35.1  
#> # ℹ 44,107 more rows
#> # ℹ 47 more variables: location_sortable <chr>, alias_symbol <list>,
#> #   alias_name <list>, prev_symbol <list>, prev_name <list>, gene_group <list>,
#> #   gene_group_id <list>, date_approved_reserved <date>,
#> #   date_symbol_changed <date>, date_name_changed <date>, date_modified <date>,
#> #   entrez_id <chr>, ensembl_gene_id <chr>, vega_id <chr>, ucsc_id <chr>,
#> #   ena <list>, refseq_accession <list>, ccds_id <list>, uniprot_ids <list>, …

Saving to disk

import_hgnc_dataset() reads data directly into memory. Use download_hgnc_dataset() to save data into disk first, and import straightaway from disk:

# Note that `download_hgnc_dataset()` returns the path of the saved file.
(local_file <- download_hgnc_dataset())
#> [1] "/home/runner/work/hgnc/hgnc/vignettes/articles/hgnc_complete_set.tsv"
import_hgnc_dataset(file = local_file)
#> # A tibble: 44,117 × 55
#>    hgnc_id    hgnc_id2 symbol    name     locus_group locus_type status location
#>    <chr>         <int> <chr>     <chr>    <chr>       <chr>      <chr>  <chr>   
#>  1 HGNC:5            5 A1BG      alpha-1… protein-co… gene with… Appro… 19q13.43
#>  2 HGNC:37133    37133 A1BG-AS1  A1BG an… non-coding… RNA, long… Appro… 19q13.43
#>  3 HGNC:24086    24086 A1CF      APOBEC1… protein-co… gene with… Appro… 10q11.23
#>  4 HGNC:7            7 A2M       alpha-2… protein-co… gene with… Appro… 12p13.31
#>  5 HGNC:27057    27057 A2M-AS1   A2M ant… non-coding… RNA, long… Appro… 12p13.31
#>  6 HGNC:23336    23336 A2ML1     alpha-2… protein-co… gene with… Appro… 12p13.31
#>  7 HGNC:41022    41022 A2ML1-AS1 A2ML1 a… non-coding… RNA, long… Appro… 12p13.31
#>  8 HGNC:41523    41523 A2ML1-AS2 A2ML1 a… non-coding… RNA, long… Appro… 12p13.31
#>  9 HGNC:8            8 A2MP1     alpha-2… pseudogene  pseudogene Appro… 12p13.31
#> 10 HGNC:30005    30005 A3GALT2   alpha 1… protein-co… gene with… Appro… 1p35.1  
#> # ℹ 44,107 more rows
#> # ℹ 47 more variables: location_sortable <chr>, alias_symbol <list>,
#> #   alias_name <list>, prev_symbol <list>, prev_name <list>, gene_group <list>,
#> #   gene_group_id <list>, date_approved_reserved <date>,
#> #   date_symbol_changed <date>, date_name_changed <date>, date_modified <date>,
#> #   entrez_id <chr>, ensembl_gene_id <chr>, vega_id <chr>, ucsc_id <chr>,
#> #   ena <list>, refseq_accession <list>, ccds_id <list>, uniprot_ids <list>, …

Data releases

HGNC complete gene data sets are released monthly and quarterly. If you prefer to download a specific release other than the latest, use first the function list_archives() to list available archives:

(releases <- list_archives("monthly"))
#> # A tibble: 135 × 9
#>    series  dataset     archive file  date       size  last_modified       md5sum
#>    <chr>   <chr>       <chr>   <chr> <date>     <chr> <dttm>              <chr> 
#>  1 monthly hgnc_compl… hgnc_c… hgnc… 2021-03-01 15.2… 2024-10-01 14:52:08 5bdr4…
#>  2 monthly hgnc_compl… hgnc_c… hgnc… 2021-04-01 15.2… 2024-10-01 14:52:02 kQTSM…
#>  3 monthly hgnc_compl… hgnc_c… hgnc… 2021-05-01 15.2… 2024-10-01 14:52:15 dRPRI…
#>  4 monthly hgnc_compl… hgnc_c… hgnc… 2021-06-01 15.8… 2024-10-01 14:51:59 GzZd4…
#>  5 monthly hgnc_compl… hgnc_c… hgnc… 2021-07-01 15.8… 2024-10-01 14:52:10 POYCr…
#>  6 monthly hgnc_compl… hgnc_c… hgnc… 2021-08-01 15.9… 2024-10-01 14:52:01 rsxim…
#>  7 monthly hgnc_compl… hgnc_c… hgnc… 2021-09-01 15.9… 2024-10-01 14:52:14 UH0Q+…
#>  8 monthly hgnc_compl… hgnc_c… hgnc… 2021-10-01 15.9… 2024-10-01 14:51:55 0m9EC…
#>  9 monthly hgnc_compl… hgnc_c… hgnc… 2021-11-01 15.9… 2024-10-01 14:52:08 xihTt…
#> 10 monthly hgnc_compl… hgnc_c… hgnc… 2021-12-01 16.0… 2024-10-01 14:51:53 jbFMg…
#> # ℹ 125 more rows
#> # ℹ 1 more variable: url <chr>

Then pass the URL corresponding to the archive of interest to import_hgnc_dataset():

# First release, 2021 March 01
(url <- releases$url[1])
#> [1] "https://storage.googleapis.com/download/storage/v1/b/public-download-files/o/hgnc%2Farchive%2Farchive%2Fmonthly%2Ftsv%2Fhgnc_complete_set_2021-03-01.txt?generation=1727794328124222&alt=media"
import_hgnc_dataset(file = url)
#> # A tibble: 42,423 × 53
#>    hgnc_id    hgnc_id2 symbol    name     locus_group locus_type status location
#>    <chr>         <int> <chr>     <chr>    <chr>       <chr>      <chr>  <chr>   
#>  1 HGNC:5            5 A1BG      alpha-1… protein-co… gene with… Appro… 19q13.43
#>  2 HGNC:37133    37133 A1BG-AS1  A1BG an… non-coding… RNA, long… Appro… 19q13.43
#>  3 HGNC:24086    24086 A1CF      APOBEC1… protein-co… gene with… Appro… 10q11.23
#>  4 HGNC:7            7 A2M       alpha-2… protein-co… gene with… Appro… 12p13.31
#>  5 HGNC:27057    27057 A2M-AS1   A2M ant… non-coding… RNA, long… Appro… 12p13.31
#>  6 HGNC:23336    23336 A2ML1     alpha-2… protein-co… gene with… Appro… 12p13.31
#>  7 HGNC:41022    41022 A2ML1-AS1 A2ML1 a… non-coding… RNA, long… Appro… 12p13.31
#>  8 HGNC:41523    41523 A2ML1-AS2 A2ML1 a… non-coding… RNA, long… Appro… 12p13.31
#>  9 HGNC:8            8 A2MP1     alpha-2… pseudogene  pseudogene Appro… 12p13.31
#> 10 HGNC:30005    30005 A3GALT2   alpha 1… protein-co… gene with… Appro… 1p35.1  
#> # ℹ 42,413 more rows
#> # ℹ 45 more variables: location_sortable <chr>, alias_symbol <list>,
#> #   alias_name <list>, prev_symbol <list>, prev_name <list>, gene_family <chr>,
#> #   gene_family_id <chr>, date_approved_reserved <date>,
#> #   date_symbol_changed <date>, date_name_changed <date>, date_modified <date>,
#> #   entrez_id <chr>, ensembl_gene_id <chr>, vega_id <chr>, ucsc_id <chr>,
#> #   ena <list>, refseq_accession <list>, ccds_id <list>, uniprot_ids <list>, …