FastaIO.jl — FASTA file reader and writer module
This module provides ways to parse and write files in FASTA format in Julia. It is designed to be lightweight and fast; the parsing method is inspired by kseq.h. It can read and write files on the fly, keeping only one entry at a time in memory, and it can read and write gzip-compressed files.
Here is a quick example for reading a file:
julia> using FastaIO
julia> FastaReader("somefile.fasta") do fr
for (desc, seq) in fr
println("$desc : $seq")
end
end
And for writing:
julia> using FastaIO
julia> FastaWriter("somefile.fasta") do fw
for s in [">GENE1", "GCATT", ">GENE2", "ATTAGC"]
write(fw, s)
end
end
Installation and usage
To install the module, use Julia's package manager: start pkg mode by pressing ]
and then enter:
(v1.3) pkg> add FastaIO
Dependencies will be installed automatically. The module can then be loaded like any other Julia module:
julia> using FastaIO
Introductory notes
For both reading and writing, there are quick methods to read/write all the data at once: readfasta
and writefasta
. These, however, require all the data to be stored in memory at once, which may be impossible or undesirable for very large files. Therefore, for both reading and writing, the preferred way is actually to use specialized types, FastaReader
and FastaWriter
, which have the ability to process one entry (description + sequence data) at a time (the writer can actually process one char at a time); however, note that these two object types are not symmetric: the reader acts as an iterable object, while the writer behaves similarly to an IO
stream.
The FASTA format
The FASTA format which is assumed by this module is as follows:
- description lines must start with a
>
character, and cannot be empty - only one description line per entry is allowed
- all characters must be ASCII
- whitespace is not allowed within sequence data (except for newlines) and at the beginning or end of the description
- Empty lines are ignored (note however that lines containing whitespace will still trigger an error)
When writing, description lines longer than 80 characters will trigger a warning message (this can be optionally disabled); sequence data is formatted in lines of 80 characters each; extra whitespace is silently discarded. No other restriction is put on the content of the sequence data, except that the >
character is forbidden.
When reading, almost no explicit checks are performed to test that the data actually conforms to these specifications.
The sequence storage type
When reading FASTA files, the container type used to store the sequence data can be chosen (as an optional argument to readfasta
or as a parametric type of FastaReader
). The default is String
, which is the most memory-efficient and the fastest; another performance-optimal option is Vector{UInt8}
, which is a less friendly representation, but has the advantage of being mutable. Any other container T
for which convert(::Type{T}, ::Vector{UInt8})
is defined can be used (e.g. Vector{Char}
, or a more specialized Vector{AminoAcid}
if you use BioSeq, but the conversion will generally slightly reduce the performance.
Reading files
FastaIO.readfasta
— Functionreadfasta(file::Union{String,IO}, [sequence_type::Type = String])
This function parses a whole FASTA file at once and stores it into memory. The result is a Vector{Any}
whose elements are tuples consisting of (description, sequence)
, where description
is a String
and sequence
contains the sequence data, stored in a container type defined by the sequence_type
optional argument (see The sequence storage type section for more information).
FastaIO.FastaReader
— MethodFastaReader{T}(file::Union{AbstractString,IO})
This creates an object which is able to parse FASTA files, one entry at a time. file
can be a plain text file or a gzip-compressed file (it will be autodetected from the content). The type T
determines the output type of the sequences (see The sequence storage type section for more information) and it defaults to String
.
The data can be read out by iterating the FastaReader
object:
for (name, seq) in FastaReader("somefile.fasta")
# do something with name and seq
end
As shown, the iterator returns a tuple containing the description (always a String
) and the data (whose type is set when creating the FastaReader
object (e.g. FastaReader{Vector{UInt8}}(filename)
).
The FastaReader
type has a field num_parsed
which contains the number of entries parsed so far.
Other ways to read out the data are via the readentry
and readfasta
functions.
FastaIO.FastaReader
— MethodFastaReader(f::Function, filename::AbstractString, [sequence_type::Type = String])
This format of the constructor is useful for do-notation, i.e.:
FastaReader(filename) do fr
# read out the data from fr, e.g.
for (name, seq) in fr
# do something with name and seq
end
end
which ensures that the close
function is called and is thus recommended (otherwise the file is closed by the garbage collector when the FastaReader
object goes out of scope).
FastaIO.readentry
— Functionreadentry(fr::FastaReader)
This function can be used to read entries one at a time:
fr = FastaReader("somefile.fasta")
name, seq = readentry(fr)
See also the eof
function.
FastaIO.rewind
— Methodrewind(fr::FastaReader)
This function rewinds the reader, so that it can restart the parsing again without closing and re-opening it. It also resets the value of the num_parsed
field.
Base.eof
— Methodeof(fr::FastaReader)
This function extends Base.eof
and tests for end-of-file condition; it is useful when using readentry
:
fr = FastaReader("somefile.fasta")
while !eof(fr)
name, seq = readentry(fr)
# do something
end
close(fr)
Base.close
— Methodclose(fr::FastaReader)
This function extends Base.close
and closes the stream associated with the FastaReader
; the reader must not be used any more after this function is called.
Writing files
FastaIO.writefasta
— Methodwritefasta(filename::String, data, [mode::String = "w"]; check_description=true)
This function dumps data to a FASTA file, auto-formatting it so to follow the specifications detailed in the section titled The FASTA format. The data
can be anything which is iterable and which produces (description, sequence)
tuples upon iteration, where the description
must be convertible to a String
and the sequence
can be any iterable object which yields elements convertible to ASCII characters (e.g. a String
, a Vector{UInt8}
etc.).
Examples:
writefasta("somefile.fasta", [("GENE1", "GCATT"), ("GENE2", "ATTAGC")])
writefasta("somefile.fasta", ["GENE1" => "GCATT", "GENE2" => "ATTAGC"])
If the filename
ends with .gz
, the result will be a gzip-compressed file.
The mode
flag determines how the filename
is open; use "a"
to append the data to an existing file.
Set the keyword check_description=false
to disable the warning message given when description lines are too long.
FastaIO.writefasta
— Methodwritefasta([io::IO = stdout], data; check_description=true)
This version of the function writes to an already opened IO
stream, defaulting to stdout
.
Set the keyword check_description=false
to disable the warning message given when description lines are too long.
FastaIO.FastaWriter
— TypeFastaWriter(filename::AbstractString, [mode::String = "w"])
FastaWriter([io::IO = stdout])
FastaWriter(f::Function, args...)
This creates an object which is able to write formatted FASTA files which conform to the specifications detailed in the section titled The FASTA format, via the write
and writeentry
functions.
The third form allows to use do-notation:
FastaWriter("somefile.fasta") do fw
# write the file
end
which is strongly recommended since it ensures that the close
function is called at the end of writing: this is crucial, as failing to do so may result in incomplete files (this is done by the finalizer, so it will still happen automatically if the FastaWriter
object goes out of scope and is garbage-collected, but there is no guarantee that this will happen if Julia exits).
If the filename
ends with .gz
, the result will be gzip-compressed.
The mode
flag can be used to set the opening mode of the file; use "a"
to append to an existing file.
The FastaWriter
object has an entry::Int
field which stores the number of the entry which is currently being written.
After creating the object, you can set the check_description
field to false
to disable the warning given when description lines are too long.
FastaIO.writeentry
— Functionwriteentry(fw::FastaWriter, description::AbstractString, sequence)
This function writes one entry to the FASTA file, following the specifications detailed in the section titled The FASTA format. The description
is without the initial '>'
character. The sequence
can be any iterable object whose elements are convertible to ASCII characters.
Example:
FastaWriter("somefile.fasta") do fw
for (desc,seq) in [("GENE1", "GCATT"), ("GENE2", "ATTAGC")]
writeentry(fw, desc, seq)
end
end
Base.write
— Methodwrite(fw::FastaWriter, item)
This function extends Base.write
and streams items to a FASTA file, which will be formatted according to the specifications detailed in the section titled The FASTA format.
When using this method, description lines are marked by the fact that they begin with a '>'
character; anything else is assumed to be part of the sequence data.
If item
is a Vector
, write
will be called iteratively over it; if it is a String
, a newline will be appended to it and it will be dumped. For example the following code:
FastaWriter("somefile.fasta") do fw
for s in [">GENE1", "GCA", "TTT", ">GENE2", "ATTAGC"]
write(fw, s)
end
end
will result in the file:
>GENE1
GCATTT
>GENE2
ATTAGC
If item
is not a Vector
nor a String
, it must be convertible to an ASCII character, and it will be piped into the file. For example the following code:
data = """
>GENE1
GCA
TTT
>GENE2
ATT
AGC
"""
FastaWriter("somefile.fasta") do fw
for ch in data
write(fw, ch)
end
end
will result in the same file as above.
Base.close
— Methodclose(fw::FastaWriter)
This function extends Base.close
and it should always be explicitly used for finalizing the FastaWriter
once the writing has finished, unless the do-notation is used when creating it.