The HUGO Gene Nomenclature Committee (HGNC) is a committee of the Human Genome Organisation (HUGO) that sets the standards for human gene nomenclature.
The HGNC approves a unique and meaningful name for every known human gene, based on a group of experts. In addition, the HGNC also provides the mapping between gene symbols to gene entries in other popular databases or resources: the HGNC complete gene set.
Use the function import_hgnc_dataset()
to import the
latest HGNC complete gene data set directly into R. This downloads the
latest archive from https://www.genenames.org/download/archive/monthly/tsv/.
(hgnc_dataset <- import_hgnc_dataset())
#> # A tibble: 44,117 × 55
#> hgnc_id hgnc_id2 symbol name locus_group locus_type status location
#> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 HGNC:5 5 A1BG alpha-1… protein-co… gene with… Appro… 19q13.43
#> 2 HGNC:37133 37133 A1BG-AS1 A1BG an… non-coding… RNA, long… Appro… 19q13.43
#> 3 HGNC:24086 24086 A1CF APOBEC1… protein-co… gene with… Appro… 10q11.23
#> 4 HGNC:7 7 A2M alpha-2… protein-co… gene with… Appro… 12p13.31
#> 5 HGNC:27057 27057 A2M-AS1 A2M ant… non-coding… RNA, long… Appro… 12p13.31
#> 6 HGNC:23336 23336 A2ML1 alpha-2… protein-co… gene with… Appro… 12p13.31
#> 7 HGNC:41022 41022 A2ML1-AS1 A2ML1 a… non-coding… RNA, long… Appro… 12p13.31
#> 8 HGNC:41523 41523 A2ML1-AS2 A2ML1 a… non-coding… RNA, long… Appro… 12p13.31
#> 9 HGNC:8 8 A2MP1 alpha-2… pseudogene pseudogene Appro… 12p13.31
#> 10 HGNC:30005 30005 A3GALT2 alpha 1… protein-co… gene with… Appro… 1p35.1
#> # ℹ 44,107 more rows
#> # ℹ 47 more variables: location_sortable <chr>, alias_symbol <list>,
#> # alias_name <list>, prev_symbol <list>, prev_name <list>, gene_group <list>,
#> # gene_group_id <list>, date_approved_reserved <date>,
#> # date_symbol_changed <date>, date_name_changed <date>, date_modified <date>,
#> # entrez_id <chr>, ensembl_gene_id <chr>, vega_id <chr>, ucsc_id <chr>,
#> # ena <list>, refseq_accession <list>, ccds_id <list>, uniprot_ids <list>, …
Saving to disk
import_hgnc_dataset()
reads data directly into memory.
Use download_hgnc_dataset()
to save data into disk first,
and import straightaway from disk:
# Note that `download_hgnc_dataset()` returns the path of the saved file.
(local_file <- download_hgnc_dataset())
#> [1] "/home/runner/work/hgnc/hgnc/vignettes/articles/hgnc_complete_set.tsv"
import_hgnc_dataset(file = local_file)
#> # A tibble: 44,117 × 55
#> hgnc_id hgnc_id2 symbol name locus_group locus_type status location
#> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 HGNC:5 5 A1BG alpha-1… protein-co… gene with… Appro… 19q13.43
#> 2 HGNC:37133 37133 A1BG-AS1 A1BG an… non-coding… RNA, long… Appro… 19q13.43
#> 3 HGNC:24086 24086 A1CF APOBEC1… protein-co… gene with… Appro… 10q11.23
#> 4 HGNC:7 7 A2M alpha-2… protein-co… gene with… Appro… 12p13.31
#> 5 HGNC:27057 27057 A2M-AS1 A2M ant… non-coding… RNA, long… Appro… 12p13.31
#> 6 HGNC:23336 23336 A2ML1 alpha-2… protein-co… gene with… Appro… 12p13.31
#> 7 HGNC:41022 41022 A2ML1-AS1 A2ML1 a… non-coding… RNA, long… Appro… 12p13.31
#> 8 HGNC:41523 41523 A2ML1-AS2 A2ML1 a… non-coding… RNA, long… Appro… 12p13.31
#> 9 HGNC:8 8 A2MP1 alpha-2… pseudogene pseudogene Appro… 12p13.31
#> 10 HGNC:30005 30005 A3GALT2 alpha 1… protein-co… gene with… Appro… 1p35.1
#> # ℹ 44,107 more rows
#> # ℹ 47 more variables: location_sortable <chr>, alias_symbol <list>,
#> # alias_name <list>, prev_symbol <list>, prev_name <list>, gene_group <list>,
#> # gene_group_id <list>, date_approved_reserved <date>,
#> # date_symbol_changed <date>, date_name_changed <date>, date_modified <date>,
#> # entrez_id <chr>, ensembl_gene_id <chr>, vega_id <chr>, ucsc_id <chr>,
#> # ena <list>, refseq_accession <list>, ccds_id <list>, uniprot_ids <list>, …
Data releases
HGNC complete gene data sets are released monthly and quarterly. If
you prefer to download a specific release other than the latest, use
first the function list_archives()
to list available
archives:
(releases <- list_archives("monthly"))
#> # A tibble: 135 × 9
#> series dataset archive file date size last_modified md5sum
#> <chr> <chr> <chr> <chr> <date> <chr> <dttm> <chr>
#> 1 monthly hgnc_compl… hgnc_c… hgnc… 2021-03-01 15.2… 2024-10-01 14:52:08 5bdr4…
#> 2 monthly hgnc_compl… hgnc_c… hgnc… 2021-04-01 15.2… 2024-10-01 14:52:02 kQTSM…
#> 3 monthly hgnc_compl… hgnc_c… hgnc… 2021-05-01 15.2… 2024-10-01 14:52:15 dRPRI…
#> 4 monthly hgnc_compl… hgnc_c… hgnc… 2021-06-01 15.8… 2024-10-01 14:51:59 GzZd4…
#> 5 monthly hgnc_compl… hgnc_c… hgnc… 2021-07-01 15.8… 2024-10-01 14:52:10 POYCr…
#> 6 monthly hgnc_compl… hgnc_c… hgnc… 2021-08-01 15.9… 2024-10-01 14:52:01 rsxim…
#> 7 monthly hgnc_compl… hgnc_c… hgnc… 2021-09-01 15.9… 2024-10-01 14:52:14 UH0Q+…
#> 8 monthly hgnc_compl… hgnc_c… hgnc… 2021-10-01 15.9… 2024-10-01 14:51:55 0m9EC…
#> 9 monthly hgnc_compl… hgnc_c… hgnc… 2021-11-01 15.9… 2024-10-01 14:52:08 xihTt…
#> 10 monthly hgnc_compl… hgnc_c… hgnc… 2021-12-01 16.0… 2024-10-01 14:51:53 jbFMg…
#> # ℹ 125 more rows
#> # ℹ 1 more variable: url <chr>
Then pass the URL corresponding to the archive of interest to
import_hgnc_dataset()
:
# First release, 2021 March 01
(url <- releases$url[1])
#> [1] "https://storage.googleapis.com/download/storage/v1/b/public-download-files/o/hgnc%2Farchive%2Farchive%2Fmonthly%2Ftsv%2Fhgnc_complete_set_2021-03-01.txt?generation=1727794328124222&alt=media"
import_hgnc_dataset(file = url)
#> # A tibble: 42,423 × 53
#> hgnc_id hgnc_id2 symbol name locus_group locus_type status location
#> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 HGNC:5 5 A1BG alpha-1… protein-co… gene with… Appro… 19q13.43
#> 2 HGNC:37133 37133 A1BG-AS1 A1BG an… non-coding… RNA, long… Appro… 19q13.43
#> 3 HGNC:24086 24086 A1CF APOBEC1… protein-co… gene with… Appro… 10q11.23
#> 4 HGNC:7 7 A2M alpha-2… protein-co… gene with… Appro… 12p13.31
#> 5 HGNC:27057 27057 A2M-AS1 A2M ant… non-coding… RNA, long… Appro… 12p13.31
#> 6 HGNC:23336 23336 A2ML1 alpha-2… protein-co… gene with… Appro… 12p13.31
#> 7 HGNC:41022 41022 A2ML1-AS1 A2ML1 a… non-coding… RNA, long… Appro… 12p13.31
#> 8 HGNC:41523 41523 A2ML1-AS2 A2ML1 a… non-coding… RNA, long… Appro… 12p13.31
#> 9 HGNC:8 8 A2MP1 alpha-2… pseudogene pseudogene Appro… 12p13.31
#> 10 HGNC:30005 30005 A3GALT2 alpha 1… protein-co… gene with… Appro… 1p35.1
#> # ℹ 42,413 more rows
#> # ℹ 45 more variables: location_sortable <chr>, alias_symbol <list>,
#> # alias_name <list>, prev_symbol <list>, prev_name <list>, gene_family <chr>,
#> # gene_family_id <chr>, date_approved_reserved <date>,
#> # date_symbol_changed <date>, date_name_changed <date>, date_modified <date>,
#> # entrez_id <chr>, ensembl_gene_id <chr>, vega_id <chr>, ucsc_id <chr>,
#> # ena <list>, refseq_accession <list>, ccds_id <list>, uniprot_ids <list>, …