Title: | Fast routines for hanlding large ESRI shapefiles (.shp) |
---|---|
Description: | Routines for handling of large ESRI shapefiles (.shp). This includes reading, thinning of points and matching of points to containing shapes. The main aim for this package is to provide the speed to support large shapefiles (millions of points). It is several orders of maginute faster than some other shapefile packages. |
Authors: | Simon Urbanek <[email protected]> |
Maintainer: | Simon Urbanek <[email protected]> |
License: | GPL-2 |
Version: | 0.1-2 |
Built: | 2024-12-01 05:10:01 UTC |
Source: | https://github.com/s-u/fastshp |
as.shp
coerces lists of polygons to adhere to the
shp
representation.
as.shp(x)
as.shp(x)
x |
object to coerce |
Objects of the class "shp"
are lists of shapes. Each shape
comprises of polygons and has an identifier and a bounding box.
The as.shp
method coerces various representations of shape sets
into "shp"
objects.
Currently the only supported representation to convert from is a list
of "x"
, "y"
pairs, e.g.: list(list(x=..., y=...),
list(x=..., y=...), ...)
where each element of the list defines a
shape. The shapes will be assigned sequential identifiers starting at
1.
Object of the class "shp"
.
Simon Urbanek
centr
computes centroids and areas of shapes
centr(shp)
centr(shp)
shp |
shapefile object as returned by |
Data frame with columns:
cx |
x coordinate of the centroid |
cy |
y coordinate of the centroid |
area |
area of the shape |
If a shape has more than one part, only the centroid and area of the first part is computed, subsequent parts are discarded with a warning.
inside
returns indices of shapes that contain given points
inside(shp, x, y, clockwise = TRUE, all = FALSE)
inside(shp, x, y, clockwise = TRUE, all = FALSE)
shp |
shape object (as returned by |
x |
x coordinates of the points to match |
y |
y coordinates of the points to match |
clockwise |
logical: if |
all |
logical: if |
The matching uses bounding box as a first approximation and then
winding rule (due east) to determine whether the point is inside. If
more than one shape matches, the index of the first matching shape is
returned unless all=TRUE
is set in which case each entry is
a list of all matches. Points exactly on the boundary (as far as
possible by floating point arithmetics) are not considered
inside. Note that the shape cooridnates must be in R polygon format
(format="polygon"
in read.shp
) or have just one part,
otherwise parts will not be treated properly.
If all=FALSE
:
Integer vector of the same length as the number of points, each value
is either an index of the first matching shape or NA
if the
point is not inside of any shapes.
If all=TRUE
:
List of integer vectors with the indices of matching shapes (which
whill be empty if there is no match). There will be as many elements
in the list as there are points.
Holes are currently not considered by this method. There is a slower implementation using CGAL that treats holes properly.
Simon Urbanek
merges.tiles
merges adjacent polygons
merge.tiles(x, y, id = rep(1L, length(x)))
merge.tiles(x, y, id = rep(1L, length(x)))
x |
x coordinates |
y |
y coordinates |
id |
identifiers defining tiles - all contiguous coordinates with
the same |
Each tile is a polygon defined by points with the id
of the
tile. The points for the same id
must be contiguous.
No tiles are allowed to overlap. Shared edges must be defined by
points common to both tiles. The orientation must be consistent since
only edges with opposite orientation will be merged.
The algorithm used to merge finds all common points in the data, for each common point checks whether edges from that point are shared and any such edges are marked for deletion.
The resulting set of id
s will be a subset of the initial
id
s. Merged tiles recevied the id
that comes earlier in
the input id
vector.
List:
x |
x coordinates |
y |
y coordinates |
id |
identifiers defining the resulting tiles |
Simon Urbanek
plot.shp
plots a list of shapes
plot.shp(x, xlim, ylim, asp = 1/cos(sum(ylim)/360 * pi), add = FALSE, axes = FALSE, full = TRUE, hold = FALSE, col = "#e0e0e0", border = "#808080", ...)
plot.shp(x, xlim, ylim, asp = 1/cos(sum(ylim)/360 * pi), add = FALSE, axes = FALSE, full = TRUE, hold = FALSE, col = "#e0e0e0", border = "#808080", ...)
x |
|
xlim |
range to plot in |
ylim |
range to plot in |
asp |
aspect ratio; the default ensures least distortion in the center of the image |
add |
logical; if |
axes |
logical; passed to |
.
full |
logical; if |
hold |
logical; if |
col |
colors for the shapes; each shape (possibly consisting of multiple polygons) consumes one element |
border |
borders for the shapes; consumer and recycled jsut like
|
... |
additional arguments |
NULL
invisibly
It is most efficient to plot the result of read.shp(...,
format="list")
. All other types are converted into that type before
plotting. It is most inefficient to use format="table"
.
Simon Urbanek
# Census 2010 TIGER/Line(TM) state shapefile fn <- system.file("shp", "tl_2010_us_state10.shp.xz", package="fastshp") # contrary to the advice above we use the table format, because # it is a huge file with many points, so we use constrained thinning # which works on tables s <- read.shp(xzfile(fn, "rb"), "table") # substantially reduce the number of points s <- s[thin.shp(s, 0.01),] # focus on continental US only q <- list(x=c(-127.35, -65), y = c(51.5, 22.23)) plot.shp(s, q$x, q$y)
# Census 2010 TIGER/Line(TM) state shapefile fn <- system.file("shp", "tl_2010_us_state10.shp.xz", package="fastshp") # contrary to the advice above we use the table format, because # it is a huge file with many points, so we use constrained thinning # which works on tables s <- read.shp(xzfile(fn, "rb"), "table") # substantially reduce the number of points s <- s[thin.shp(s, 0.01),] # focus on continental US only q <- list(x=c(-127.35, -65), y = c(51.5, 22.23)) plot.shp(s, q$x, q$y)
read.shp
reads ESRI shapefile format (.shp
). Currently
only polygons and polylines are supported.
read.shp(where, format = c("list", "pairlist", "polygon", "table"), close = TRUE)
read.shp(where, format = c("list", "pairlist", "polygon", "table"), close = TRUE)
where |
filename to read the data from or a binary connection or a raw vector with the data content |
format |
output format (see below for details), defaults to "list". |
close |
if |
The result depends on the value of the format
argument:
"list" |
list (generic vector) of shapes exactly as represented
in the
|
"pairlist" |
same as "list" except that the list of shapes is stored in a pairlist and not a list. This is the most memory-efficient way of reading a shapefile, because and all other formats are derived from first reading this format. Pairlists are good for linear scans but inefficient for indexing. |
"polygon" |
same as "list" except that coordinates are
represented in R polygon format (parts are separated by |
"table" |
The result is a data frame with the following columns:
|
Although other packages exist for reading shapefiles, the focus of this implementation is speed, so it works on very large compilations of shapes, such as the Tiger database which is impossible to load using naive R implementations.
Simon Urbanek
# Census 2010 TIGER/Line(TM) state shapefile fn <- system.file("shp", "tl_2010_us_state10.shp.xz", package="fastshp") s <- read.shp(xzfile(fn, "rb"))
# Census 2010 TIGER/Line(TM) state shapefile fn <- system.file("shp", "tl_2010_us_state10.shp.xz", package="fastshp") s <- read.shp(xzfile(fn, "rb"))
thin
thins out a polyline/polygon by removing points that
are deemed to have no visual effect under the given tolerance.
thin(x, y, tolerance = 1e-4, lock = NULL, method = 2L, id = NULL) thin.shp(shp, tolerance = 1e-3, max.width = 5L, all = !is.data.frame(shp))
thin(x, y, tolerance = 1e-4, lock = NULL, method = 2L, id = NULL) thin.shp(shp, tolerance = 1e-3, max.width = 5L, all = !is.data.frame(shp))
x |
x coordinates of the points |
y |
y coordinates of the points |
tolerance |
maximum allowable distance for a point to be removed |
lock |
defines points that cannot be removed. Can be
|
method |
Must be one of |
id |
optional index |
shp |
shape object as returned by
|
max.width |
the maximum number of shapes that a single point can belong to. It determines the size of the adjacency table created in the process. |
all |
determines whether the thinning information is included in
the shape object and returned as the whole object ( |
thin
performs thinning of one or more polygons defined by
coordinates x
and y
, where polygons are separated by
NA
.
The default algorithm used here is very simple and fast: it performs a
linear scan through all points and for each convex point it measures
the distance of the point from a line connecting last unthinned point
and the subsequent point. If this distance is below tolerance it is
removed. Note that the x, y space must be Euclidean so coordinates
may need to be transformed accordingly (i.e. typically you don't want
to use uncorrected lat/lon!). This fast algorithm guarantees only
n * tolerance
accuracy with n
being the number of
subsequently removed points. The extra error will be more noticeable
for subsequent slowly drifting points.
The alternative algorithm (method = 2L
) additionally checks
whether any of the previously removed points would be out of tolerance
as well - this adds complexity (it is quardatic in the number of
removed points), but guarantees that the result is never further than
tolerance
away from the original shape.
The input x
, y
can contain multiple segments separated
by NA
(R polygon format). Segments are always assumed to be a
loop (you can still use keep
to force both ends to be
non-removable).
thin.shp
performs a constrained thinning (eventually using
thin
) whereby segments that are shared by two or more polygons
are guaranteed to be shared even after thinning. This is done by
computing the index from each shared point to all the same points,
then comparing running segments that have the same shape id
list in that index and referenciong only the the first set of points
so that the thinnig of those will be used for all subsequent segments
of the same point sequence. In addition, all points as which the shape
id
list changes are declared as fixed points that cannot be
removed.
All points are compared by their actual coordinate value, no fudge factor is applied, so the source is assumed to be consistent.
thin
: logical vector of the same length as the number of points
with TRUE
for points that are kept and FALSE
for removed
points.
this.shp
: same as thin
if all = FALSE
, otherwise
the shp
shape obejct is augmented with thin
element
which contains the result of thin
and the object itself is
returned.
Simon Urbanek
# load 2010 Census TIGER/Line(TM) state data (if included) shp <- system.file("shp","tl_2010_us_state10.shp.xz",package="fastshp") if (nzchar(shp)) { s <- read.shp(xzfile(shp, "rb"), "pol") # thin on a cylindrical projection (around ca. 37 deg lat) t <- lapply(s, function(o) thin(o$x / 1.25, o$y, 1e-3, method = 1L)) par(mar = rep(0, 4)) plot(c(-125,-67), c(25, 49.4), asp=1.25, ty='n', axes=FALSE) for (i in seq.int(s)) polygon(s[[i]]$x[t[[i]]], s[[i]]$y[t[[i]]], col="#eeeeee") cat(" reduction: ", 100 - sum(sapply(t, sum)) / sum(sapply(t, length)) * 100, "%\n", sep='') # use the more conservative algorithm t <- lapply(s, function(o) thin(o$x / 1.25, o$y, 1e-3, method = 2L)) cat(" reduction: ", 100 - sum(sapply(t, sum)) / sum(sapply(t, length)) * 100, "%\n", sep='') # use constrained thinning: st <- read.shp(xzfile(shp, "rb"), "table") st$x <- st$x / 1.25 a <- thin.shp(st, 1e-3) cat(" reduction: ", 100 - sum(a) / length(a) * 100, "%\n", sep='') par(mfrow=c(1, 2)) # compare unconstrained and constrained thinning up close (NY/NJ area) plot(0, 0, xlim=c(-74.22, -74.15), ylim=c(40.55,40.67), asp=1.25, axes=FALSE) for (i in seq.int(s)) polygon(s[[i]]$x[t[[i]]], s[[i]]$y[t[[i]]], col=c("#0000ff80","#80800080")[i %% 2 + 1L], border=1) plot(0, 0, xlim=c(-74.22, -74.15), ylim=c(40.55,40.67), asp=1.25, axes=FALSE) for (i in unique(st$id)) polygon(st$x[st$id==i]*1.25, st$y[st$id==i], col=c("#0000ff80","#80800080")[i %% 2 + 1L], border=1) }
# load 2010 Census TIGER/Line(TM) state data (if included) shp <- system.file("shp","tl_2010_us_state10.shp.xz",package="fastshp") if (nzchar(shp)) { s <- read.shp(xzfile(shp, "rb"), "pol") # thin on a cylindrical projection (around ca. 37 deg lat) t <- lapply(s, function(o) thin(o$x / 1.25, o$y, 1e-3, method = 1L)) par(mar = rep(0, 4)) plot(c(-125,-67), c(25, 49.4), asp=1.25, ty='n', axes=FALSE) for (i in seq.int(s)) polygon(s[[i]]$x[t[[i]]], s[[i]]$y[t[[i]]], col="#eeeeee") cat(" reduction: ", 100 - sum(sapply(t, sum)) / sum(sapply(t, length)) * 100, "%\n", sep='') # use the more conservative algorithm t <- lapply(s, function(o) thin(o$x / 1.25, o$y, 1e-3, method = 2L)) cat(" reduction: ", 100 - sum(sapply(t, sum)) / sum(sapply(t, length)) * 100, "%\n", sep='') # use constrained thinning: st <- read.shp(xzfile(shp, "rb"), "table") st$x <- st$x / 1.25 a <- thin.shp(st, 1e-3) cat(" reduction: ", 100 - sum(a) / length(a) * 100, "%\n", sep='') par(mfrow=c(1, 2)) # compare unconstrained and constrained thinning up close (NY/NJ area) plot(0, 0, xlim=c(-74.22, -74.15), ylim=c(40.55,40.67), asp=1.25, axes=FALSE) for (i in seq.int(s)) polygon(s[[i]]$x[t[[i]]], s[[i]]$y[t[[i]]], col=c("#0000ff80","#80800080")[i %% 2 + 1L], border=1) plot(0, 0, xlim=c(-74.22, -74.15), ylim=c(40.55,40.67), asp=1.25, axes=FALSE) for (i in unique(st$id)) polygon(st$x[st$id==i]*1.25, st$y[st$id==i], col=c("#0000ff80","#80800080")[i %% 2 + 1L], border=1) }