Title: | OPUS Miner Algorithm for Filtered Top-k Association Discovery |
---|---|
Description: | Provides a simple R interface to the OPUS Miner algorithm (implemented in C++) for finding the top-k productive, non-redundant itemsets from transaction data. The OPUS Miner algorithm uses the OPUS search algorithm to efficiently discover the key associations in transaction data, in the form of self-sufficient itemsets, using either leverage or lift. See <http://i.giwebb.com/index.php/research/association-discovery/> for more information in relation to the OPUS Miner algorithm. |
Authors: | Geoffrey I Webb [aut, cph] (OPUS Miner algorithm and C++ implementation, http://i.giwebb.com/index.php/research/association-discovery/), Christoph Bergmeir [ctb, cre], Angus Dempster [ctb, cph] (R interface) |
Maintainer: | Christoph Bergmeir <[email protected]> |
License: | GPL-3 |
Version: | 0.1-1 |
Built: | 2024-11-05 04:26:51 UTC |
Source: | https://github.com/cran/opusminer |
The opusminer
package provides an R interface to the OPUS Miner
algorithm (implemented in C++), developed by Professor Geoffrey I Webb, for
finding the top k, non-redundant itemsets on the measure of interest.
OPUS Miner is a branch-and-bound algorithm for efficient discovery of self-sufficient itemsets. For a user-specified k and interest measure, OPUS Miner finds the top k productive non-redundant itemsets with respect to the specified measure. It is then straightforward to filter out those that are not independently productive with respect to that set, resulting in a set of self-sufficient itemsets.
OPUS Miner is based on the OPUS search algorithm. OPUS is a set enumeration algorithm distinguished by a computationally efficient pruning mechanism that ensures that whenever an item is pruned, it is removed from the entire search space below the parent node.
OPUS Miner systematically traverses viable regions of the search space (using depth-first search), maintaining a collection of the top k productive non-redundant itemsets in the search space explored. When all of the viable regions have been explored, the top k productive non-redundant itemsets in the search space explored must be the top k for the entire search space.
A comprehensive explanation of the algorithm is provided in the article cited below.
Webb, G. I., & Vreeken, J. (2014). Efficient Discovery of the Most Interesting Associations. ACM Transactions on Knowledge Discovery from Data, 8(3), 1-15. doi: http://dx.doi.org/10.1145/2601433
opus
finds the top k productive, non-redundant itemsets on the
measure of interest (leverage or lift) using the OPUS Miner algorithm.
opus(transactions, k = 100, format = "data.frame", sep = " ", print_closures = FALSE, filter_itemsets = TRUE, search_by_lift = FALSE, correct_for_mult_compare = TRUE, redundancy_tests = TRUE)
opus(transactions, k = 100, format = "data.frame", sep = " ", print_closures = FALSE, filter_itemsets = TRUE, search_by_lift = FALSE, correct_for_mult_compare = TRUE, redundancy_tests = TRUE)
transactions |
A filename, list, or object of class
|
k |
The number of itemsets to return, an integer (default 100). |
format |
The output format ("data.frame", default, or "itemsets"). |
sep |
The separator between items (for files, default " "). |
print_closures |
return the closure for each itemset (default |
filter_itemsets |
filter itemsets that are not independently productive (default |
search_by_lift |
make lift (rather than leverage) the measure of interest (default |
correct_for_mult_compare |
correct alpha for the size of the search space (default |
redundancy_tests |
exclude redundant itemsets (default |
opus
provides an interface to the OPUS Miner algorithm (implemented in
C++) to find the top k productive, non-redundant itemsets by leverage
(default) or lift.
transactions
should be a filename, list (of transactions, each list
element being a vector of character values representing item labels), or an
object of class transactions
(arules
).
Files should be in the format of a list of transactions, one line per
transaction, each transaction (ie, line) being a sequence of item labels,
separated by the character specified by the parameter sep
(default "
"). See, for example, the files at http://fimi.ua.ac.be/data/.
(Alternatively, files can be read seaparately using the
read_transactions
function.)
format
should be specified as either "data.frame" (the default) or
"itemsets", and any other value will return a list.
The top k productive, non-redundant itemsets, with relevant
statistics, in the form of a data frame, object of class
itemsets
(arules
), or a list.
Webb, G. I., & Vreeken, J. (2014). Efficient Discovery of the Most Interesting Associations. ACM Transactions on Knowledge Discovery from Data, 8(3), 1-15. doi: http://dx.doi.org/10.1145/2601433
## Not run: result <- opus("mushroom.dat") result <- opus("mushroom.dat", k = 50) result[result$self_sufficient, ] result[order(result$count, decreasing = TRUE), ] trans <- read_transactions("mushroom.dat", format = "transactions") result <- opus(trans, print_closures = TRUE) result <- opus(trans, format = "itemsets") ## End(Not run)
## Not run: result <- opus("mushroom.dat") result <- opus("mushroom.dat", k = 50) result[result$self_sufficient, ] result[order(result$count, decreasing = TRUE), ] trans <- read_transactions("mushroom.dat", format = "transactions") result <- opus(trans, print_closures = TRUE) result <- opus(trans, format = "itemsets") ## End(Not run)
read_transactions
reads transaction data from a file fast, providing a
significant speed increase over alternative methods for larger files.
read_transactions(filename, sep = " ", format = "list")
read_transactions(filename, sep = " ", format = "list")
filename |
A filename. |
sep |
The separator between items (default " "). |
format |
The output format ("list" or "transactions"). |
read_transactions
uses (internally) the readChar
function to read transaction data from a file fast. This is substantially
faster for larger files than alternative methods.
Files should be in the format of a list of transactions, one line per
transaction, each transaction (ie, line) being a sequence of item labels,
separated by the character specified by the parameter sep
(default "
"). See, for example, the files at http://fimi.ua.ac.be/data/.
The transaction data, in the form of a list (of transactions, each
list element being a vector of character values representing item labels), or
an object of class transactions
(arules
).
## Not run: trans <- read_transactions("mushroom.dat") trans <- read_transactions("mushroom.dat", format = "transactions") ## End(Not run)
## Not run: trans <- read_transactions("mushroom.dat") trans <- read_transactions("mushroom.dat", format = "transactions") ## End(Not run)