Ziggy release 0.7.0 (2024-10-25)

This commit is contained in:
Bill Wohler 2024-10-30 15:16:18 -07:00
parent 78729cbe38
commit ef06239f5c
378 changed files with 10859 additions and 12188 deletions

View File

@ -1,8 +1,8 @@
cff-version: 1.2.0
message: "If you use this software in your research, please cite it using these metadata."
title: Ziggy
version: v0.6.0
date-released: "2024-07-26"
version: v0.7.0
date-released: "2024-10-25"
abstract: "Ziggy, a portable, scalable infrastructure for science data processing pipelines, is the child of the Transiting Exoplanet Survey Satellite (TESS) pipeline and the grandchild of the Kepler Pipeline."
authors:
- family-names: Tenenbaum

View File

@ -4,6 +4,49 @@
These are the release notes for Ziggy. In the change log below, we'll refer to our internal Jira key for our benefit. If the item is associated with a resolved GitHub issue or pull request, we'll add a link to that. Changes that are incompatible with previous versions are marked below. While the major version is 0, we will be making breaking changes when bumping the minor number. However, once we hit version 1.0.0, incompatible changes will only be made when bumping the major number.
# v0.7.0: Halloween release
This release is coming out just before Halloween, and it's full of tricks and treats. Behind the scenes, we continued to buy down decades of technical debt. Are we finally getting close to paying off that loan? By eliminating the StateFile API (ZIGGY-465) and fixing ZIGGY-432, ZIGGY-454, and ZIGGY-478, the pipeline no longer stalls or crashes for mysterious reasons. We fixed a few UI annoyances like collapsing tree controls and Control-Click not working as expected on the Mac.
## New Features
1. Rename sub-task to subtask (ZIGGY-79)
1. Use log4j2 conventions and features (ZIGGY-82)
1. Provide means for clients/algorithms to add their software version to Ziggy's data accountability (ZIGGY-430)
1. Fix Group design (ZIGGY-431)
1. Refactor PipelineTask (ZIGGY-433)
1. Limit console to operations (ZIGGY-441)
1. Add PipelineInstanceIdOperations methods to PipelineInstanceOperations (ZIGGY-445)
1. Support copying files to task directory without datastore (ZIGGY-446)
1. Retrieve DatastoreRegexps from the database by name (ZIGGY-447)
1. Locate consumed files used to produce a file (ZIGGY-448)
1. Ensure importer can add and update module and pipeline definitions (ZIGGY-452)
1. Write HDF5 files usable by Zowie (ZIGGY-455)
1. Eliminate the need for programmatic appenders (ZIGGY-456)
1. TaskMonitor doesn't change processing step from QUEUED to EXECUTING (ZIGGY-457)
1. Add parameter retrieval to PipelineTaskOperations (ZIGGY-460)
1. Check for new vs existing files in datastore (ZIGGY-461)
1. Eliminate StateFile API (ZIGGY-465)
## Bug Fixes
1. Double-click resize is lost when table auto-update occurs (ZIGGY-297)
1. Collapsing Parameter Library and Pipelines tree controls (ZIGGY-360)
1. Can't halt SUBMITTED tasks (ZIGGY-424)
1. Resume monitoring can't be stopped (ZIGGY-425)
1. Race condition in pipeline workers (ZIGGY-432)
1. Ziggy C++ Mex build tools set incorrect install name (ZIGGY-444)
1. Warning alert clears error alert status (ZIGGY-450)
1. Python distutils module removed from Python 3.12 (ZIGGY-451)
1. Local processing crashes sporadically (ZIGGY-454)
1. ZiggyQuery chunkedIn doesn't work (ZIGGY-462)
1. Remote execution dialog can't parse numbers with commas (ZIGGY-463)
1. Parameter API populates empty arrays (ZIGGY-468)
1. Module parameter sets in HDF5 have incorrect field order values (ZIGGY-469)
1. Worker never exits when subtask errors (ZIGGY-478)
1. Control-Click clears selection on the Mac (ZIGGY-479)
1. Exceptions when using pipeline instance filters (ZIGGY-489)
# v0.6.0: You never have to wonder what Ae 4 / 3 / 0 means again
We fixed a confusing aspect of the user interface and a ton of bugs while we continued to buy down decades of technical debt. You can now halt tasks or instances from the command-line interface (CLI). We improved pipeline definitions by making datastore definitions more flexible and providing for user-specified data receipt unit of work (UOW) generators.

View File

@ -124,64 +124,6 @@ java {
withSourcesJar()
}
test {
systemProperty "java.library.path", "$outsideDir/lib"
maxHeapSize = "1024m"
testLogging {
events "failed", "skipped"
}
useJUnit {
// If a code coverage report that incudes the integration tests is desired, then comment
// out the IntegrationTestCategory line and uncomment the RunByNameTestCategory line. When
// the JaCoCo issue described below is resolved, then delete this comment.
// excludeCategories 'gov.nasa.ziggy.RunByNameTestCategory'
excludeCategories 'gov.nasa.ziggy.IntegrationTestCategory'
}
// Use "gradle -P traceTests test" to show test order.
if (project.hasProperty("traceTests")) {
afterTest { desc, result ->
logger.quiet "${desc.className}.${desc.name}: ${result.resultType}"
}
}
}
// Execute tests marked with @Category(IntegrationTestCategory.class).
task integrationTest(type: Test) {
systemProperty "log4j2.configurationFile", "$rootDir/etc/log4j2.xml"
systemProperty "ziggy.logfile", "$buildDir/build.log"
systemProperty "java.library.path", "$outsideDir/lib"
testLogging {
events "failed", "skipped"
}
useJUnit {
includeCategories 'gov.nasa.ziggy.IntegrationTestCategory'
excludeCategories 'gov.nasa.ziggy.RunByNameTestCategory'
}
}
// Execute tests marked with @Category(RunByNameTestCategory.class).
// These tests are typically run explicitly with the --tests option
// since they don't play well with others. For example:
// gradle runByNameTests --tests *RmiInterProcessCommunicationTest
task runByNameTest(type: Test) {
systemProperty "log4j2.configurationFile", "$rootDir/etc/log4j2.xml"
systemProperty "ziggy.logfile", "$buildDir/build.log"
systemProperty "java.library.path", "$outsideDir/lib"
useJUnit {
includeCategories 'gov.nasa.ziggy.RunByNameTestCategory'
}
}
// Task specified by the Ziggy Software Management Plan (SMP) to run all tests.
task testAll
testAll.dependsOn test, integrationTest
check.dependsOn testAll
// To view code coverage, run the jacocoTestReport task and view the output in:
// build/reports/jacoco/test/html/index.html.
check.dependsOn jacocoTestReport
@ -234,13 +176,18 @@ tasks.withType(com.github.spotbugs.snom.SpotBugsTask) {
task copyOutsideLibs
compileJava.dependsOn copyOutsideLibs
// Apply Ziggy Gradle script plugins.
// A couple of the other scripts depend on integrationTest.
apply from: "script-plugins/test.gradle"
// Most scripts in alphabetical order.
apply from: "script-plugins/copy.gradle"
apply from: "script-plugins/database-schemas.gradle"
apply from: "script-plugins/eclipse.gradle"
apply from: "script-plugins/hdf5.gradle"
apply from: "script-plugins/misc.gradle"
apply from: "script-plugins/test.gradle"
apply from: "script-plugins/wrapper.gradle"
apply from: "script-plugins/xml-schemas.gradle"
apply from: "script-plugins/ziggy-libraries.gradle"

View File

@ -124,7 +124,7 @@ public class Mcc extends TessExecTask {
@TaskAction
public void action() {
log.info(String.format("%s.action()\n", this.getClass().getSimpleName()));
log.info("{}.action()", this.getClass().getSimpleName());
File matlabHome = matlabHome();
File buildBinDir = new File(getProject().getBuildDir(), "bin");
@ -150,9 +150,9 @@ public class Mcc extends TessExecTask {
path += ".app";
executable = new File(path);
String message = String.format(
"The outputExecutable, \"%s\", already exists and cannot be deleted\n", executable);
"The outputExecutable, %s, already exists and cannot be deleted\n", executable);
if (executable.exists()) {
log.info(String.format("%s: already exists, delete it\n", executable));
log.info("{} already exists, delete it", executable);
if (executable.isDirectory()) {
try {
FileUtils.deleteDirectory(executable);
@ -198,7 +198,7 @@ public class Mcc extends TessExecTask {
processBuilder.environment()
.put("MCC_DIR", getProject().getProjectDir().getCanonicalPath());
} catch (IOException e) {
log.error(String.format("Could not set MCC_DIR: %s", e.getMessage()), e);
log.error("Could not set MCC_DIR: {}", e.getMessage(), e);
}
execProcess(processBuilder);

View File

@ -297,7 +297,7 @@ public class ZiggyCppMex extends DefaultTask {
Project project = getProject();
if (project.hasProperty(MATLAB_PATH_PROJECT_PROPERTY)) {
matlabPath = project.findProperty(MATLAB_PATH_PROJECT_PROPERTY).toString();
log.info("MATLAB path set from project extra property: " + matlabPath);
log.info("MATLAB path set from project extra property {}", matlabPath);
}
if (matlabPath == null) {
String systemPath = System.getenv("PATH");
@ -307,7 +307,7 @@ public class ZiggyCppMex extends DefaultTask {
String pathLower = path.toLowerCase();
if (pathLower.contains("matlab") && path.endsWith("bin")) {
matlabPath = path.substring(0, path.length() - 4);
log.info("MATLAB path set from PATH environment variable: " + matlabPath);
log.info("MATLAB path set from PATH environment variable {}", matlabPath);
break;
}
}
@ -317,7 +317,7 @@ public class ZiggyCppMex extends DefaultTask {
String matlabHome = System.getenv(MATLAB_PATH_ENV_VAR);
if (matlabHome != null) {
matlabPath = matlabHome;
log.info("MATLAB path set from MATLAB_HOME environment variable: " + matlabPath);
log.info("MATLAB path set from MATLAB_HOME environment variable {}", matlabPath);
}
}
if (matlabPath == null) {

View File

@ -256,7 +256,7 @@ public class ZiggyCppMexPojo extends ZiggyCppPojo {
@Override
public void action() {
log.info(String.format("%s.action()\n", this.getClass().getSimpleName()));
log.info("{}.action()", this.getClass().getSimpleName());
// Start by performing the compilation
compileAction();

View File

@ -240,8 +240,8 @@ public class ZiggyCppPojo {
fileListBuilder.append(file.getName());
fileListBuilder.append(" ");
}
log.info("List of C/C++ files in directory " + sourceFilePaths + ": "
+ fileListBuilder.toString());
log.info("List of C/C++ files in directory {}: {}", sourceFilePaths,
fileListBuilder.toString());
}
}
@ -490,7 +490,7 @@ public class ZiggyCppPojo {
*/
public void action() {
log.info(String.format("%s.action()\n", this.getClass().getSimpleName()));
log.info("{}.action()", this.getClass().getSimpleName());
// compile the source files
compileAction();
@ -505,7 +505,7 @@ public class ZiggyCppPojo {
File objDir = objDir();
if (!objDir.exists()) {
log.info("mkdir: " + objDir.getAbsolutePath());
log.info("Creating directory {}", objDir.getAbsolutePath());
objDir.mkdirs();
}
@ -551,7 +551,7 @@ public class ZiggyCppPojo {
destDir = libDir();
}
if (!destDir.exists()) {
log.info("mkdir: " + destDir.getAbsolutePath());
log.info("Creating directory {}", destDir.getAbsolutePath());
destDir.mkdirs();
}
try {

View File

@ -1,159 +0,0 @@
package gov.nasa.ziggy.buildutil;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;
import org.gradle.api.DefaultTask;
import org.gradle.api.tasks.Input;
import org.gradle.api.tasks.OutputFile;
import org.gradle.api.tasks.TaskAction;
import com.google.common.collect.ImmutableList;
/**
* Places version info in the property file {@value #BUILD_CONFIGURATION}). The available Gradle
* properties used to set the names of the various properties that describe the version include:
* <dl>
* <dt>{@code versionPropertyName}</dt>
* <dd>name of the property that holds the output of {@code git describe} (default:
* {@value #DEFAULT_BUILD_VERSION_PROPERTY_NAME})</dd>
* <dt>{@code branchPropertyName}</dt>
* <dd>name of the property that contains the name of the current branch
* (default:{@value #DEFAULT_BUILD_BRANCH_PROPERTY_NAME})</dd>
* <dt>{@code commitPropertyName}</dt>
* <dd>name of the property that contains the commit
* (default:{@value #DEFAULT_BUILD_COMMIT_PROPERTY_NAME})</dd>
* </dl>
* <p>
* The following example shows how to to run this plugin only when necessary by saving the current
* Git version in a property called {@code ziggyVersion}. The default values of the aforementioned
* properties are used.
*
* <pre>
* import gov.nasa.ziggy.buildutil.ZiggyVersionGenerator
*
* def gitVersion = new ByteArrayOutputStream()
* exec {
* commandLine "git", "rev-parse", "HEAD"
* standardOutput = gitVersion
* }
* gitVersion = gitVersion.toString().trim()
*
* task ziggyVersion(type: ZiggyVersionGenerator) {
* inputs.property "ziggyVersion", gitVersion
* }
* </pre>
*
* See ZiggyConfiguration.
*/
public class ZiggyVersionGenerator extends DefaultTask {
private static final String BUILD_CONFIGURATION = "src/main/resources/ziggy-build.properties";
private static final String DEFAULT_BUILD_VERSION_PROPERTY_NAME = "ziggy.version";
private static final String DEFAULT_BUILD_BRANCH_PROPERTY_NAME = "ziggy.version.branch";
private static final String DEFAULT_BUILD_COMMIT_PROPERTY_NAME = "ziggy.version.commit";
/**
* Length that object names are abbreviated to. The number is set to 10, because that is
* currently 1 more than necessary to distinguish all commit hashes to date. One command to
* determine the maximum hash length in use is:
*
* <pre>
* $ git rev-list --all --abbrev=0 --abbrev-commit | awk '{print length()}' | sort -n | uniq -c
* 1040 4
* 7000 5
* 1149 6
* 68 7
* 8 8
* 1 9
* </pre>
*/
private static final int ABBREV = 10;
private static final String HEADER = "# This file is automatically generated by Gradle."
+ System.lineSeparator();
private String versionPropertyName = DEFAULT_BUILD_VERSION_PROPERTY_NAME;
private String branchPropertyName = DEFAULT_BUILD_BRANCH_PROPERTY_NAME;
private String commitPropertyName = DEFAULT_BUILD_COMMIT_PROPERTY_NAME;
@TaskAction
public void generateVersionProperties() throws IOException, InterruptedException {
File outputFile = new File(getProject().getProjectDir(), getBuildConfiguration());
try (BufferedWriter output = new BufferedWriter(new FileWriter(outputFile))) {
output.write(HEADER);
output.write(getVersionPropertyName() + " = "
+ runCommand(ImmutableList.of("git", "describe", "--always", "--abbrev=" + ABBREV))
+ System.lineSeparator());
output.write(getBranchPropertyName() + " = "
+ runCommand(ImmutableList.of("git", "rev-parse", "--abbrev-ref", "HEAD"))
+ System.lineSeparator());
output.write(getCommitPropertyName() + " = "
+ runCommand(ImmutableList.of("git", "rev-parse", "--short=" + ABBREV, "HEAD"))
+ System.lineSeparator());
}
}
public String runCommand(List<String> command) throws IOException, InterruptedException {
Process process = new ProcessBuilder(command).start();
List<String> output = new ArrayList<>();
BufferedReader bufferedReader = new BufferedReader(
new InputStreamReader(process.getInputStream()));
for (;;) {
String line = bufferedReader.readLine();
if (line == null) {
break;
}
output.add(line);
}
process.waitFor();
return output.stream().collect(Collectors.joining(System.lineSeparator())).trim();
}
/** Override this to create your own subclass for pipeline-side version generation. */
@OutputFile
public String getBuildConfiguration() {
return BUILD_CONFIGURATION;
}
@Input
public String getVersionPropertyName() {
return versionPropertyName;
}
public void setVersionPropertyName(String versionPropertyName) {
this.versionPropertyName = versionPropertyName;
}
@Input
public String getBranchPropertyName() {
return branchPropertyName;
}
public void setBranchPropertyName(String branchPropertyName) {
this.branchPropertyName = branchPropertyName;
}
@Input
public String getCommitPropertyName() {
return commitPropertyName;
}
public void setCommitPropertyName(String commitPropertyName) {
this.commitPropertyName = commitPropertyName;
}
}

View File

@ -1,115 +0,0 @@
defaultTasks 'build'
def getDate() {
def date = new Date()
def formattedDate = date.format('yyyyMMdd')
return formattedDate
}
task cleanestDryRun(type: Exec) {
description = "Removes pdf and .gradle directories (DRY RUN)."
outputs.upToDateWhen { false }
workingDir = rootDir
commandLine "sh", "-c", "git clean --force -x -d --dry-run"
}
task cleanest(type: Exec) {
description = "Removes pdf and .gradle directories."
outputs.upToDateWhen { false }
workingDir = rootDir
commandLine "sh", "-c", "git clean --force -x -d"
}
subprojects {
defaultTasks 'build'
task build() {
}
task makeDocId() {
description = "Generates a doc-id.sty file."
inputs.files fileTree(dir: projectDir, include: '**/*.tex', exclude: '**/build/**').files
outputs.file "$buildDir/doc-id.sty"
makeDocId.doFirst {
mkdir buildDir
}
doLast() {
if (!project.hasProperty('docId')) {
return
}
exec {
workingDir buildDir
commandLine "bash", "-c", "echo -E '\\newcommand{\\DOCID}{$docId}' > doc-id.sty"
}
}
}
task compileLaTex(dependsOn: makeDocId) {
description = "Compiles the .tex files into a .pdf file."
inputs.files fileTree(dir: projectDir, include: '**/*.tex', exclude: '**/build/**').files
outputs.files fileTree(dir: buildDir, include: '**/*.pdf').files
doFirst {
mkdir buildDir
}
doLast {
if (!project.hasProperty('texFileName')) {
return
}
// Execute twice to update references and a third time for BibTeX.
3.times {
exec {
executable 'pdflatex'
workingDir project.workingDir
args '-output-directory=build'
args '-interaction=nonstopmode'
args '-halt-on-error'
args texFileName
}
}
}
}
build.dependsOn compileLaTex
task publish(dependsOn: build) {
description = "Publishes the .pdf file into the pdf directory."
inputs.dir buildDir
outputs.files fileTree(rootDir.getPath() + '/pdf').include('**/*-' + getDate() + '.pdf').files
doFirst() {
mkdir rootDir.getPath() + '/pdf/' + publishDir
}
doLast() {
if (!project.hasProperty('texFileName') || !project.hasProperty('publishDir') || !project.hasProperty('docId')) {
return
}
copy {
from(buildDir) {
rename '^(.*).pdf$', docId + '-$1-' + getDate() + '.pdf'
}
into rootDir.getPath() + '/pdf/' + publishDir
include '**/*.pdf'
}
}
}
task clean() {
doLast() {
delete buildDir
}
}
}

View File

@ -2,7 +2,7 @@
[[Previous]](nicknames.md)
[[Up]](dusty-corners.md)
[[Next]](contact-us.md)
[[Next]](version-tracking.md)
## Advanced Unit of Work Configurations
@ -69,4 +69,4 @@ Fortunately, the `year` and `doy` parts of the `fileNameRegexp` will respect the
[[Previous]](nicknames.md)
[[Up]](dusty-corners.md)
[[Next]](contact-us.md)
[[Next]](version-tracking.md)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 22 KiB

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 22 KiB

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 35 KiB

After

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 12 KiB

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 11 KiB

After

Width:  |  Height:  |  Size: 14 KiB

View File

@ -108,7 +108,7 @@ You'll need to make several changes:
This amounts to manually importing into the database the XML files that define the pipeline, parameters, and data types. Fortunately, there are ziggy commands you can use for all of these actions:
- The command `ziggy import-parameters` allows you to read in the parameter library files.
- The command `ziggy import-types` allows you to read in the data file type definitions.
- The command `ziggy import-datastore-config` allows you to read in the data file type definitions and the definition of the datastore layout.
- The command `ziggy import-pipelines` allows you to read in the pipeline definitions.
All of the commands listed above will allow you to get help to know the exact syntax, order of arguments, etc. For more information on the ziggy program, take a look at [the article on running the cluster](running-pipeline.md). Most importantly: **Be sure to run the commands in the order shown above**, and specifically **be sure to run import-pipelines last!**

View File

@ -12,7 +12,7 @@ The good news is that it's straightforward to update a pipeline definition in Zi
To see this in action, open up the Pipelines panel, select the sample pipeline, and run the `View` command from the context menu. You'll see this:
<img src="images/pipelines-config-1.png" style="width:11cm;"/>
<img src="images/pipelines-config-1.png" style="width:9cm;"/>
This shows the modules in the pipeline and their order of execution.
@ -30,7 +30,7 @@ $ ziggy update-pipelines sample-pipeline/config-extra/pd-sample.xml
**Refresh the Pipelines Panel:** Press the `Refresh` button in the pipelines panel. Again, select the sample pipeline, and run the `View` command from the context menu. You'll now see this:
<img src="images/pipelines-config-2.png" style="width:11cm;"/>
<img src="images/pipelines-config-2.png" style="width:9cm;"/>
As advertised, the averaging module has been removed from the end of the pipeline.

View File

@ -34,10 +34,6 @@ This tells Ziggy to try to pick up where it left off, in effect to force the pro
Generally the case where you'd use this is the one we're in now: some subtasks ran, others didn't, the task made it to `WAITING_TO_STORE` and then halted. Selecting this option will cause Ziggy to advance to `STORING`, at which point it will store the results from the 3 successful subtasks and abandon efforts to get results from the failed one.
#### Resume Monitoring
In somewhat unusual cases, it may happen that the monitoring subsystem will lose track of what's going on with one or more tasks. In this case, you may see signs that execution is progressing, but the console doesn't show any updates. In this case, the `Resume monitoring` option tells the monitoring subsystem to try to reconnect with the running task.
### Restarting Multiple Tasks
If you have a bunch of tasks for a given node, it may be that some will fail while others don't, and you want to restart all the failed tasks. There are two ways to do this.
@ -54,7 +50,7 @@ Why not?
If the subtask had failed because of a real problem, we would be able to fix the problem and resubmit the task, or restart from the beginning. But what actually happened is that we set a module parameter that told `permuter` to throw an exception in subtask 0.
If we re-run the task, it will re-run with the same values for all parameters (except for the remote execution parameters, but we're not using those at all for this example). This means that the `throw exception subtask 0 parameter` will still be true, and subtask 0 will fail again.
If we re-run the task, it will re-run with the same values for all parameters. This means that the `throw exception subtask 0 parameter` will still be true, and subtask 0 will fail again.
In real life, it's possible that you'll encounter a situation like this one, in which a task fails and the only way to get it to run successfully is to change the values of some module parameters. In that case, you won't be able to re-run because re-run doesn't let you change the parameters. In that case, you'll need to change the parameters and use the pipelines panel to start a new pipeline instance. In the more common cases (software bug that had to be fixed, failure due to some sort of hardware problem, etc.), re-running a task offers the possibility of getting failed subtasks to run to completion. For example, in this case we could simulate "fixing" the problem updating the code to ignore the `throw exception subtask 0 parameter`.

View File

@ -30,6 +30,7 @@ dump-system-properties gov.nasa.ziggy.services.config.DumpSystemProperties
execsql gov.nasa.ziggy.services.database.SqlRunner
export-parameters gov.nasa.ziggy.pipeline.xml.ParameterLibraryImportExportCli
export-pipelines gov.nasa.ziggy.pipeline.definition.PipelineDefinitionCli
generate-build-info gov.nasa.ziggy.util.BuildInfo
generate-manifest gov.nasa.ziggy.data.management.Manifest
hsqlgui org.hsqldb.util.DatabaseManagerSwing
import-datastore-config gov.nasa.ziggy.data.management.DatastoreConfigurationImporter
@ -38,11 +39,14 @@ import-parameters gov.nasa.ziggy.pipeline.xml.ParameterLibraryImportExpor
import-pipelines gov.nasa.ziggy.pipeline.definition.PipelineDefinitionCli
metrics gov.nasa.ziggy.metrics.report.MetricsCli
perf-report gov.nasa.ziggy.metrics.report.PerformanceReport
$
update-pipelines gov.nasa.ziggy.pipeline.definition.PipelineDefinitionCli
$
```
You can view more help with `ziggy --help` and even more help with `perldoc ziggy`.
Since there are a lot of commands, sub-commands, and options, we've created a bash completions file for the `ziggy` program so you can press the `TAB` key while entering the `ziggy` program to display the available commands and options. If you want to use it, run `. $ZIGGY_ROOT/etc/ziggy.bash-completion`. That's a dot at the front; it's the same mechanism that you would use to re-read your `.bashrc` file.
If you should happen to write some Java to manage your pipeline and want to use the `ziggy` program to run it, please refer to the article on [Creating Ziggy Nicknames](nicknames.md).
### Ziggy Cluster Commands
@ -104,7 +108,7 @@ That said: if your cluster initialization fails because of a problem in the XML,
If the failure was in the import of the contents of the pipeline-defining XML files, there's an alternative to using `ziggy cluster init`. Specifically, you can use other ziggy commands that import the XML files without performing initialization.
If you look at the list of ziggy nicknames in the top screen shot, there are 3 that will be helpful here: `import-parameters`, `import-types`, and `import-pipelines`. These do what they say: import the parameter library, data type definition, and pipeline definition files, respectively.
If you look at the list of ziggy nicknames in the top screen shot, there are 3 that will be helpful here: `import-parameters`, `import-datastore-config`, and `import-pipelines`. These do what they say: import the parameter library, data type definition, and pipeline definition files, respectively.
Important note: if you decide to manually import the XML files, **you must do so in the order shown above:** parameters, then data types, then the pipeline definitions. This is because some items can't import correctly unless other items that they depend upon have already been pulled in.

View File

@ -138,7 +138,9 @@ Ziggy is "A Pipeline management system for Data Analysis Pipelines." This is the
19.6. [Advanced Unit of Work Configurations](advanced-uow.md)
<!-- 19.7. [Customizing Ziggy](customizing-ziggy.md) -->
19.7. [Software Version Tracking](version-tracking.md)
<!-- 19.8. [Customizing Ziggy](customizing-ziggy.md) -->
20. [Contact Us](contact-us.md)

View File

@ -0,0 +1,36 @@
<!-- -*-visual-line-*- -->
[[Previous]](advanced-uow.md)
[[Up]](dusty-corners.md)
[[Next]](contact-us.md)
## Software Version Tracking
One of the key features of data accountability is knowledge of what version of your software (and our software!) was used to process any given task.
When Ziggy starts a pipeline task, it automatically puts version information into the task's database entry, and the version is automatically updated any time the task is restarted. This is done by reading in properties from the `ziggy-build.properties` file in Ziggy's `build/etc` directory. Here's an example of what that file looks like:
```
# This file is automatically generated by Ziggy.
# Do not edit.
ziggy.version = Ziggy-0.6.0-20240726-323-g75138d913c
ziggy.version.branch = feature/ZIGGY-430-pipeline-version
ziggy.version.commit = 75138d913c
```
The `ziggy-build.properties` file, in turn, is constructed at the end of Ziggy's build. There's a Java program that runs a few Git commands, captures the output, and uses them to construct the build properties file. Whenever a pipeline task starts or restarts, it reads the file and stores the content of `ziggy-build.properties`, obtains the value of `ziggy.version`, and stores it in the database.
Now, this is all well and good, but you are undoubtedly more interested in the version of the pipeline software that was used for a given task! After all, it's the pipeline software that contains the algorithms, and the pipeline results are going to be far more sensitive to changes in the algorithms than changes in Ziggy. Fortunately, Ziggy provides a method to perform a similar capture of Git information for a repository of pipeline files.
To make this work, you need to first ensure that the `ziggy.pipeline.classpath` in the pipeline properties file includes the entry `${ziggy.home.dir}/libs/*` (for a gentle refresher on the properties files, take a look at [the article on configuring a pipeline](configuring-pipeline.md), and [the appendix article on properties files](properties.md)). Once you've done that, you can use the Ziggy command `$ZIGGY_HOME/bin/ziggy generate-build-info`, which will generate the file `pipeline-build.properties`. The resulting file will go in an `etc` subdirectory of whatever directory you set up as the pipeline's home directory (i.e., the directory that property `ziggy.pipeline.home.dir` points to). When Ziggy runs a pipeline task, it will automatically pick up the pipeline version from this file and store it in the database along with the Ziggy version.
We recommend that you include this command that generates the pipeline build file in your build system. All build systems we're familiar with have an option for a build target / task that runs a shell command; you can use that capability to run the command above.
Note that this tool will only work with Git repositories. If you're using something other than Git, and would like to track your software versions, [contact us](contact-us.md) and we'll put one together for you.
Alternately, you can write a step into your own build system that generates a `pipeline-build.properties` file when you perform your build. As described above, the output of this build step must be put into the `etc` subdirectory of the directory specified by the `ziggy.pipeline.home.dir` property. The `pipeline-build.properties` file must contain a line that defines property `pipeline.version`; you can set this equal to any text string that makes sense, based on your version control system. You can also set the properties `pipeline.version.branch` and `pipeline.version.commit` if you have values for these that make sense, or you can leave them out.
[[Previous]](advanced-uow.md)
[[Up]](dusty-corners.md)
[[Next]](contact-us.md)

View File

@ -45,6 +45,9 @@
<!-- Can specify an AppenderRef, but then add additivity="false" to the parameters
to avoid duplicate log messages if that appender is found in Root. -->
<Logger name="gov.nasa.ziggy.services.database" level="info"/>
<Logger name="gov.nasa.ziggy.services.logging.WriterLogOutputStream" level="info" additivity="false">
<AppenderRef ref="algorithm"/>
</Logger>
<Logger name="gov.nasa.ziggy.services.messaging" level="info"/>
<Logger name="gov.nasa.ziggy.ui" level="info"/>
<Logger name="gov.nasa.ziggy.ui.ClusterController" level="info"/>
@ -52,10 +55,6 @@
<Logger name="gov.nasa.ziggy.util.ClasspathScanner" level="info"/>
<Logger name="gov.nasa.ziggy" level="info"/>
<Logger name="org.hibernate" level="warn"/>
<Logger name="gov.nasa.ziggy.services.logging.WriterLogOutputStream"
level="info" additivity="false">
<AppenderRef ref="algorithm"/>
</Logger>
<!--
To view the stack trace whenever DatabaseTransactionFactory.performTransaction is called, uncomment the following.
@ -63,7 +62,8 @@
-->
<!--
To view SQL, uncomment the following and add the following VM arguments to your run configuration in Eclipse.
To view SQL, uncomment the following, set the console's ThresholdFilter level to "DEBUG",
and add the following VM arguments to your run configuration in Eclipse.
-Dlog4j2.configurationFile=etc/log4j2.xml -Dhibernate.show_sql=true -Dhibernate.format_sql=true -Dhibernate.use_sql_comments=true
<Logger name="org.hibernate.SQL" level="debug"/>
-->

128
etc/ziggy.bash-completion Normal file
View File

@ -0,0 +1,128 @@
# ziggy completion -*- shell-script -*-
#
# To add Bash completion for the ziggy command, run ". $ZIGGY_ROOT/etc/ziggy.bash-completion".
_ziggy() {
local cur prev commands options
cur="${COMP_WORDS[COMP_CWORD]}"
prev="${COMP_WORDS[COMP_CWORD-1]}"
# Complete Ziggy nicknames and their commands or options.
case $prev in
ziggy)
nicknames=$(ziggy | tail -n +2 | awk '{print $1}')
COMPREPLY=($(compgen -W "${nicknames}" -- ${cur}))
return;;
cluster)
commands="init start stop status console version -h --help"
COMPREPLY=($(compgen -W "${commands}" -- ${cur}))
return;;
compute-node-master)
return;;
console)
if [ ${COMP_CWORD} -eq 2 ]; then
commands="config display halt log restart start version"
COMPREPLY=($(compgen -W "${commands}" -- ${cur}))
fi
return;;
dump-system-properties)
return;;
execsql)
return;;
export-parameters | export-pipelines | import-parameters | import-pipelines)
options="-dryrun -nodb"
COMPREPLY=($(compgen -W "${options}" -- ${cur}))
return;;
generate-manifest)
return;;
hsqlgui)
options="--help --driver --url --user --password --urlid --rcfile --dir --script --noexit"
COMPREPLY=($(compgen -W "${options}" -- ${cur}))
return;;
import-datastore-config)
return;;
import-events)
return;;
metrics)
commands="available dump report"
COMPREPLY=($(compgen -W "${commands}" -- ${cur}))
return;;
perf-report)
options="-force -id -nodes -taskdir"
COMPREPLY=($(compgen -W "${options}" -- ${cur}))
return;;
update-pipelines)
options="-dryrun"
COMPREPLY=($(compgen -W "${options}" -- ${cur}))
return;;
esac
# Complete sub-command options.
case "${COMP_WORDS[1]}" in
cluster)
case "${COMP_WORDS[2]}" in
init)
options="-f --force"
COMPREPLY=($(compgen -W "${options}" -- ${cur}))
return;;
start)
options="console --workerCount --workerHeapSize"
COMPREPLY=($(compgen -W "${options}" -- ${cur}))
return;;
esac;;
console)
case "${prev}" in
--configType)
options="data-model-registry instance pipeline pipeline-nodes"
COMPREPLY=($(compgen -W "${options}" -- ${cur}))
return;;
--displayType)
options="alerts errors full statistics statistics-detailed"
COMPREPLY=($(compgen -W "${options}" -- ${cur}))
return;;
--restartMode)
options="restart-from-beginning resume-current-step resubmit"
COMPREPLY=($(compgen -W "${options}" -- ${cur}))
return;;
esac
case "${COMP_WORDS[2]}" in
config)
options="--configType --instance --pipeline"
COMPREPLY=($(compgen -W "${options}" -- ${cur}))
return;;
display)
options="--displayType --instance --task"
COMPREPLY=($(compgen -W "${options}" -- ${cur}))
return;;
halt)
options="--instance --task"
COMPREPLY=($(compgen -W "${options}" -- ${cur}))
return;;
log)
options="--task --errors"
COMPREPLY=($(compgen -W "${options}" -- ${cur}))
return;;
restart)
options="--restartMode --instance --task"
COMPREPLY=($(compgen -W "${options}" -- ${cur}))
return;;
esac;;
esac
}
complete -F _ziggy ziggy

View File

@ -18,6 +18,7 @@ ziggy.nickname.dump-system-properties = gov.nasa.ziggy.services.config.DumpSyste
ziggy.nickname.execsql = gov.nasa.ziggy.services.database.SqlRunner|||
ziggy.nickname.export-parameters = gov.nasa.ziggy.pipeline.xml.ParameterLibraryImportExportCli|||-export
ziggy.nickname.export-pipelines = gov.nasa.ziggy.pipeline.definition.PipelineDefinitionCli|||-export
ziggy.nickname.generate-build-info = gov.nasa.ziggy.util.BuildInfo|||
ziggy.nickname.generate-manifest = gov.nasa.ziggy.data.management.Manifest|||
ziggy.nickname.hsqlgui = org.hsqldb.util.DatabaseManagerSwing|||
ziggy.nickname.import-datastore-config = gov.nasa.ziggy.data.management.DatastoreConfigurationImporter|||

View File

@ -7,7 +7,7 @@ org.gradle.parallel = true
// The version is updated when the first release candidate is created
// while following Release Branches in Appendix C of the SMP, Git
// Workflow. This property is used when publishing Ziggy.
version = 0.6.0
version = 0.7.0
// The Maven group for the published Ziggy libraries.
group = gov.nasa

View File

@ -45,7 +45,7 @@ source $python_env/bin/activate
pip3 install h5py Pillow numpy
# Get the location of the environment's site packages directory
site_pkgs=$(python3 -c "from distutils.sysconfig import get_python_lib; print(get_python_lib())")
site_pkgs=$(python3 -c "from sysconfig import get_path; print(get_path('purelib'))")
# Copy the pipeline major_tom package to the site-packages location.
cp -r $ziggy_root/sample-pipeline/src/main/python/major_tom $site_pkgs
@ -54,4 +54,7 @@ cp -r $ziggy_root/sample-pipeline/src/main/python/major_tom $site_pkgs
cp -r $ziggy_root/src/main/python/hdf5mi $site_pkgs
cp -r $ziggy_root/src/main/python/zigutils $site_pkgs
# Generate version information.
$ZIGGY_HOME/bin/ziggy generate-build-info --home $sample_home
exit 0

View File

@ -39,7 +39,7 @@ trap 'deactivate' EXIT
source $SAMPLE_PIPELINE_PYTHON_ENV/bin/activate
# Get the location of the environment's site packages directory
SITE_PKGS=$(python3 -c "from distutils.sysconfig import get_python_lib; print(get_python_lib())")
SITE_PKGS=$(python3 -c "from sysconfig import get_path; print(get_path('purelib'))")
# Use the environment's Python to run the image-averaging Python script
python3 $SITE_PKGS/major_tom/averaging.py

View File

@ -39,7 +39,7 @@ trap 'deactivate' EXIT
source $SAMPLE_PIPELINE_PYTHON_ENV/bin/activate
# Get the location of the environment's site packages directory
SITE_PKGS=$(python3 -c "from distutils.sysconfig import get_python_lib; print(get_python_lib())")
SITE_PKGS=$(python3 -c "from sysconfig import get_path; print(get_path('purelib'))")
# Use the environment's Python to run the flipper Python script
python3 $SITE_PKGS/major_tom/flipper.py

View File

@ -39,7 +39,7 @@ trap 'deactivate' EXIT
source $SAMPLE_PIPELINE_PYTHON_ENV/bin/activate
# Get the location of the environment's site packages directory.
SITE_PKGS=$(python3 -c "from distutils.sysconfig import get_python_lib; print(get_python_lib())")
SITE_PKGS=$(python3 -c "from sysconfig import get_path; print(get_path('purelib'))")
# Use the environment's Python to run the permuter Python script.
python3 $SITE_PKGS/major_tom/permuter.py

View File

@ -13,9 +13,9 @@ task generateHsqldbCreateScript(type: JavaExec, dependsOn: copyLibs) {
mainClass.set("gov.nasa.ziggy.services.database.ZiggySchemaExport")
classpath fileTree(dir: "$buildDir/libs", include: "*.jar")
jvmArgs "-Dhibernate.dialect=org.hibernate.dialect.HSQLDialect",
"-Djava.library.path=$outsideDir/lib",
"-Dhibernate.connection.url=jdbc:hsqldb:mem:test"
jvmArgs "-Dhibernate.connection.url=jdbc:hsqldb:mem:test",
"-Dhibernate.dialect=org.hibernate.dialect.HSQLDialect",
"-Djava.library.path=$outsideDir/lib"
args "--create", "--output=$buildDir/schema/ddl.hsqldb-create.sql"
logging.captureStandardOutput LogLevel.INFO
@ -23,6 +23,7 @@ task generateHsqldbCreateScript(type: JavaExec, dependsOn: copyLibs) {
}
test.dependsOn generateHsqldbCreateScript
integrationTest.dependsOn generateHsqldbCreateScript
assemble.dependsOn generateHsqldbCreateScript
task generateHsqldbDropScript(type: JavaExec, dependsOn: copyLibs) {
@ -32,9 +33,9 @@ task generateHsqldbDropScript(type: JavaExec, dependsOn: copyLibs) {
mainClass.set("gov.nasa.ziggy.services.database.ZiggySchemaExport")
classpath fileTree(dir: "$buildDir/libs", include: "*.jar")
jvmArgs "-Dhibernate.dialect=org.hibernate.dialect.HSQLDialect",
"-Djava.library.path=$outsideDir/lib",
"-Dhibernate.connection.url=jdbc:hsqldb:mem:test"
jvmArgs "-Dhibernate.connection.url=jdbc:hsqldb:mem:test",
"-Dhibernate.dialect=org.hibernate.dialect.HSQLDialect",
"-Djava.library.path=$outsideDir/lib"
args "--drop", "--output=$buildDir/schema/ddl.hsqldb-drop.sql"
logging.captureStandardOutput LogLevel.INFO
@ -42,6 +43,7 @@ task generateHsqldbDropScript(type: JavaExec, dependsOn: copyLibs) {
}
test.dependsOn generateHsqldbDropScript
integrationTest.dependsOn generateHsqldbDropScript
assemble.dependsOn generateHsqldbDropScript
task generatePostgresqlCreateScript(type: JavaExec, dependsOn: copyLibs) {
@ -51,9 +53,9 @@ task generatePostgresqlCreateScript(type: JavaExec, dependsOn: copyLibs) {
mainClass.set("gov.nasa.ziggy.services.database.ZiggySchemaExport")
classpath fileTree(dir: "$buildDir/libs", include: "*.jar")
jvmArgs "-Dhibernate.dialect=org.hibernate.dialect.PostgreSQLDialect",
"-Djava.library.path=$outsideDir/lib",
"-Dhibernate.connection.url=jdbc:hsqldb:mem:test"
jvmArgs "-Dhibernate.connection.url=jdbc:hsqldb:mem:test",
"-Dhibernate.dialect=org.hibernate.dialect.PostgreSQLDialect",
"-Djava.library.path=$outsideDir/lib"
args "--create", "--output=$buildDir/schema/ddl.postgresql-create.sql"
logging.captureStandardOutput LogLevel.INFO
@ -69,9 +71,9 @@ task generatePostgresqlDropScript(type: JavaExec, dependsOn: copyLibs) {
mainClass.set("gov.nasa.ziggy.services.database.ZiggySchemaExport")
classpath fileTree(dir: "$buildDir/libs", include: "*.jar")
jvmArgs "-Dhibernate.dialect=org.hibernate.dialect.PostgreSQLDialect",
"-Djava.library.path=$outsideDir/lib",
"-Dhibernate.connection.url=jdbc:hsqldb:mem:test"
jvmArgs "-Dhibernate.connection.url=jdbc:hsqldb:mem:test",
"-Dhibernate.dialect=org.hibernate.dialect.PostgreSQLDialect",
"-Djava.library.path=$outsideDir/lib"
args "--drop", "--output=$buildDir/schema/ddl.postgresql-drop.sql"
logging.captureStandardOutput LogLevel.INFO

View File

@ -1,9 +1,6 @@
// Show all dependencies.
task allDeps(type: DependencyReportTask) {}
// Generate the Ziggy version information.
import gov.nasa.ziggy.buildutil.ZiggyVersionGenerator
def gitVersion = new ByteArrayOutputStream()
exec {
commandLine "git", "rev-parse", "HEAD"
@ -11,14 +8,16 @@ exec {
}
gitVersion = gitVersion.toString().trim()
task ziggyVersion(type: ZiggyVersionGenerator) {
task ziggyVersion(type: JavaExec) {
inputs.property "ziggyVersion", gitVersion
outputs.file "$buildDir/etc/ziggy-build.properties"
classpath = sourceSets.main.runtimeClasspath
mainClass = "gov.nasa.ziggy.util.BuildInfo"
args = ["--prefix", "ziggy", "--home", "$buildDir"]
}
processResources.dependsOn ziggyVersion
compileTestJava.dependsOn processResources
integrationTest.dependsOn processResources
sourcesJar.dependsOn processResources
integrationTest.dependsOn ziggyVersion
assemble.dependsOn ziggyVersion
clean.doFirst() {
File supervisorPidFile = new File("$buildDir/bin/supervisor.pid");

View File

@ -1,3 +1,73 @@
test {
systemProperty "java.library.path", "$outsideDir/lib"
maxHeapSize = "1024m"
testLogging {
events "failed", "skipped"
}
useJUnit {
// If a code coverage report that incudes the integration tests is desired, then comment
// out the IntegrationTestCategory line and uncomment the RunByNameTestCategory line. When
// the JaCoCo issue described below is resolved, then delete this comment.
// excludeCategories 'gov.nasa.ziggy.RunByNameTestCategory'
excludeCategories 'gov.nasa.ziggy.IntegrationTestCategory'
}
// Use "gradle -P traceTests test" to show test order.
if (project.hasProperty("traceTests")) {
afterTest { desc, result ->
logger.quiet "${desc.className}.${desc.name}: ${result.resultType}"
}
}
}
// Execute tests marked with @Category(IntegrationTestCategory.class).
task integrationTest(type: Test) {
systemProperty "log4j2.configurationFile", "$rootDir/etc/log4j2.xml"
systemProperty "ziggy.logfile", "$buildDir/build.log"
systemProperty "java.library.path", "$outsideDir/lib"
testLogging {
events "failed", "skipped"
}
useJUnit {
includeCategories 'gov.nasa.ziggy.IntegrationTestCategory'
excludeCategories 'gov.nasa.ziggy.FlakyTestCategory'
excludeCategories 'gov.nasa.ziggy.RunByNameTestCategory'
}
}
check.dependsOn integrationTest
cleanTest.dependsOn cleanIntegrationTest
// Execute tests marked with @Category(FlakyTestCategory.class).
task flakyTest(type: Test) {
systemProperty "log4j2.configurationFile", "$rootDir/etc/log4j2.xml"
systemProperty "ziggy.logfile", "$buildDir/build.log"
systemProperty "java.library.path", "$outsideDir/lib"
useJUnit {
includeCategories 'gov.nasa.ziggy.FlakyTestCategory'
}
}
check.dependsOn flakyTest
cleanTest.dependsOn cleanFlakyTest
// Execute tests marked with @Category(RunByNameTestCategory.class).
// These tests are typically run explicitly with the --tests option
// since they don't play well with others. For example:
// gradle runByNameTests --tests *RmiInterProcessCommunicationTest
task runByNameTest(type: Test) {
systemProperty "log4j2.configurationFile", "$rootDir/etc/log4j2.xml"
systemProperty "ziggy.logfile", "$buildDir/build.log"
systemProperty "java.library.path", "$outsideDir/lib"
useJUnit {
includeCategories 'gov.nasa.ziggy.RunByNameTestCategory'
}
}
/**
* testprog is used to test external process control. testprog has the
* advantage over something like /bin/true that it can run for a specified

View File

@ -1,5 +1,5 @@
// Generate XML schemas.
task generateXmlSchemas(type: JavaExec, dependsOn: [copyLibs, processResources]) {
task generateXmlSchemas(type: JavaExec, dependsOn: copyLibs) {
outputs.dir "$projectDir/build/schema/xml"
mainClass.set("gov.nasa.ziggy.pipeline.xml.XmlSchemaExporter")

View File

@ -1,52 +0,0 @@
package gov.nasa.ziggy.collections;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
/**
* Given a list, returns successive sub-lists which encompass all of the elements of the list. Each
* sub-list is limited to some maximum size.
*
* @author Sean McCauliff
*/
public class ListChunkIterator<T> implements Iterator<List<T>>, Iterable<List<T>> {
private final Iterator<T> source;
private final int chunkSize;
public ListChunkIterator(Iterator<T> source, int chunkSize) {
if (source == null) {
throw new NullPointerException("source");
}
this.source = source;
this.chunkSize = chunkSize;
}
public ListChunkIterator(Iterable<T> source, int chunkSize) {
this(source.iterator(), chunkSize);
}
@Override
public boolean hasNext() {
return source.hasNext();
}
@Override
public List<T> next() {
List<T> rv = new ArrayList<>(chunkSize);
for (int i = 0; i < chunkSize && source.hasNext(); i++) {
rv.add(source.next());
}
return rv;
}
@Override
public void remove() {
throw new UnsupportedOperationException("Operation not supported.");
}
@Override
public Iterator<List<T>> iterator() {
return this;
}
}

View File

@ -145,6 +145,13 @@ public class ZiggyArrayUtils {
String componentType = loopObject.getClass().getComponentType().getName();
if (componentType.startsWith("[")) {
Object[] arrayObject = (Object[]) loopObject;
// Handle a zero-length array.
if (arrayObject == null || arrayObject.length == 0) {
arrayDimensionList.add(0L);
isArray = false;
continue;
}
loopObject = arrayObject[0];
isArray = true;
} else {

View File

@ -1,13 +1,20 @@
package gov.nasa.ziggy.crud;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Collections;
import java.util.List;
import java.util.function.Function;
import org.hibernate.Session;
import org.hibernate.query.Query;
import org.hibernate.query.criteria.HibernateCriteriaBuilder;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import gov.nasa.ziggy.collections.ListChunkIterator;
import com.google.common.collect.Lists;
import gov.nasa.ziggy.services.database.DatabaseController;
import gov.nasa.ziggy.services.database.DatabaseService;
import jakarta.persistence.LockModeType;
import jakarta.persistence.criteria.CriteriaBuilder;
@ -26,14 +33,7 @@ import jakarta.persistence.criteria.CriteriaUpdate;
*/
public abstract class AbstractCrud<U> implements AbstractCrudInterface<U> {
/**
* This is the maximum number of dynamically-created expressions sent to the database. This
* limit is 1000 in Oracle. A setting of 950 leaves plenty of room for other expressions in the
* query.
*
* @see ListChunkIterator
*/
public static final int MAX_EXPRESSIONS = 950;
private static final Logger log = LoggerFactory.getLogger(AbstractCrud.class);
private DatabaseService databaseService;
@ -186,6 +186,48 @@ public abstract class AbstractCrud<U> implements AbstractCrudInterface<U> {
.getResultList();
}
/**
* Performs a query in which the query must be broken into multiple discrete queries due to
* database query language limitations, the results of which are then combined and returned.
* <p>
* For example:
*
* <pre>
* chunkedQuery(pipelineTaskIds,
* chunk -> list(createZiggyQuery(PipelineTask.class).column(PipelineTask_.id)
* .ascendingOrder()
* .in(chunk)));
* </pre>
*
* The variable constraintsCollection is the collection of objects of class T that constrain the
* query and queryWithRestraints is a method that returns a query that applies the constraints.
*
* @param <T> class of the objects in the list that constrain the query
* @param <R> class of the objects in the list returned by the query
* @param source the list of elements of type T to use in the query
* @param query returns a list of results of type R based upon the collection of type T
* @return list of type R
*/
protected <T, R> List<R> chunkedQuery(List<T> source, Function<List<T>, List<R>> query) {
if (source.isEmpty()) {
return Collections.emptyList();
}
int maxExpressions = maxExpressions();
List<R> results = new ArrayList<>(maxExpressions * 2);
for (List<T> chunk : Lists.partition(source, maxExpressions)) {
log.info("Created chunk of size {}", chunk.size());
results.addAll(query.apply(chunk));
}
return results;
}
/**
* Maximum expressions allowed in a query.
*/
int maxExpressions() {
return DatabaseController.newInstance().maxExpressions();
}
/**
* Flush any changes to persistent objects to the underlying database.
*/

View File

@ -22,7 +22,7 @@ public class ProtectedEntityInterceptor implements Interceptor {
private static final List<String> allowedPrefixes = new CopyOnWriteArrayList<>();
public static void addAllowedPrefix(String prefix) {
log.info("Adding allowed prefix for flushed classes: " + prefix);
log.info("Adding allowed prefix {} for flushed classes", prefix);
allowedPrefixes.add(prefix);
}

View File

@ -7,11 +7,9 @@ import java.util.Collection;
import java.util.List;
import java.util.Set;
import org.apache.commons.collections.CollectionUtils;
import org.apache.commons.collections4.CollectionUtils;
import org.hibernate.query.criteria.HibernateCriteriaBuilder;
import com.google.common.collect.Lists;
import gov.nasa.ziggy.module.PipelineException;
import jakarta.persistence.criteria.AbstractQuery;
import jakarta.persistence.criteria.CriteriaBuilder;
@ -79,9 +77,6 @@ public class ZiggyQuery<T, R> {
// AbstractQuery allows this to be either CriteriaQuery or Subquery, as needed.
private AbstractQuery<R> jpaQuery;
/** For testing only. */
private List<List<Object>> queryChunks = new ArrayList<>();
/** Constructor for {@link CriteriaQuery} instances. */
ZiggyQuery(Class<T> databaseClass, Class<R> returnClass, AbstractCrud<?> crud) {
builder = crud.createCriteriaBuilder();
@ -244,27 +239,6 @@ public class ZiggyQuery<T, R> {
return this;
}
/**
* Performs the action of the {@link #in(Collection)} method, but performs the query in chunks.
* This allows queries in which the collection of values is too large for a single query. The
* number of values in a chunk is set by the {@link AbstractCrud#MAX_EXPRESSIONS} constant.
*/
@SuppressWarnings("unchecked")
public <Y> ZiggyQuery<T, R> chunkedIn(Collection<Y> values) {
checkState(hasScalarAttribute(), "a column has not been defined");
List<Y> valuesList = new ArrayList<>(values);
Predicate criterion = builder.disjunction();
for (List<Y> valuesSubset : Lists.partition(valuesList, maxExpressions())) {
queryChunks.add((List<Object>) valuesSubset);
criterion = builder.or(criterion,
attribute != null
? builder.in((Path<? extends Y>) root.get(attribute), valuesSubset)
: builder.in(root.get(columnName), valuesSubset));
}
predicates.add(criterion);
return this;
}
public <Y> CriteriaBuilder.In<Y> in(Expression<? extends Y> expression, Collection<Y> values) {
return builder.in(expression, values);
}
@ -579,18 +553,4 @@ public class ZiggyQuery<T, R> {
}
return (CriteriaQuery<R>) jpaQuery;
}
/**
* Maximum expressions allowed in each chunk of {@link #chunkedIn(Collection)}. Broken out into
* a package-private method so that tests can reduce this value to something small enough to
* exercise in test.
*/
int maxExpressions() {
return AbstractCrud.MAX_EXPRESSIONS;
}
/** For testing only. */
List<List<Object>> queryChunks() {
return queryChunks;
}
}

View File

@ -1,20 +0,0 @@
package gov.nasa.ziggy.data.accounting;
import gov.nasa.ziggy.pipeline.definition.PipelineTask;
/**
* Writes out a task using PiplineTask.prettyPrint()
*
* @author Sean McCauliff
*/
public class DetailedPipelineTaskRenderer implements PipelineTaskRenderer {
@Override
public String renderTask(PipelineTask task) {
return task.prettyPrint();
}
@Override
public String renderDefaultTask() {
return "Data Receipt";
}
}

View File

@ -10,7 +10,7 @@ import gov.nasa.ziggy.pipeline.definition.PipelineTask;
public class SimpleTaskRenderer implements PipelineTaskRenderer {
@Override
public String renderTask(PipelineTask task) {
return task.uowTaskInstance().briefState();
return task.getUnitOfWork().briefState();
}
@Override

View File

@ -51,7 +51,7 @@ import org.apache.commons.cli.HelpFormatter;
import org.apache.commons.cli.Option;
import org.apache.commons.cli.Options;
import org.apache.commons.cli.ParseException;
import org.apache.commons.collections.CollectionUtils;
import org.apache.commons.collections4.CollectionUtils;
import org.apache.commons.lang3.StringUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
@ -533,8 +533,9 @@ public class DatastoreConfigurationImporter {
List<String> databaseDataFileTypeNames = datastoreOperations().dataFileTypeNames();
for (DataFileType typeXb : importedDataFileTypes) {
if (databaseDataFileTypeNames.contains(typeXb.getName())) {
log.warn("Not importing data file type definition \"{}\""
+ " due to presence of existing type with same name", typeXb.getName());
log.warn(
"Not importing data file type definition {} due to presence of existing type with same name",
typeXb.getName());
dataFileTypesNotImported.add(typeXb);
continue;
}
@ -562,14 +563,13 @@ public class DatastoreConfigurationImporter {
try {
modelTypeXb.validate();
} catch (Exception e) {
log.warn("Unable to validate model type definition " + modelTypeXb.getType(), e);
log.warn("Unable to validate model type definition {}", modelTypeXb.getType(), e);
modelTypesNotImported.add(modelTypeXb);
continue;
}
if (databaseModelTypes.contains(modelTypeXb.getType())) {
log.warn(
"Not importing model type definition \"{}\""
+ " due to presence of existing type with same name",
"Not importing model type definition {} due to presence of existing type with same name",
modelTypeXb.getType());
modelTypesNotImported.add(modelTypeXb);
continue;

View File

@ -39,10 +39,10 @@ import gov.nasa.ziggy.pipeline.definition.ModelRegistry;
import gov.nasa.ziggy.pipeline.definition.ModelType;
import gov.nasa.ziggy.pipeline.definition.PipelineDefinitionNode;
import gov.nasa.ziggy.pipeline.definition.PipelineDefinitionProcessingOptions.ProcessingMode;
import gov.nasa.ziggy.pipeline.definition.PipelineTask;
import gov.nasa.ziggy.pipeline.definition.database.PipelineDefinitionNodeOperations;
import gov.nasa.ziggy.pipeline.definition.database.PipelineDefinitionOperations;
import gov.nasa.ziggy.pipeline.definition.database.PipelineTaskOperations;
import gov.nasa.ziggy.pipeline.definition.PipelineTask;
import gov.nasa.ziggy.services.alert.AlertService;
import gov.nasa.ziggy.services.config.DirectoryProperties;
import gov.nasa.ziggy.uow.DirectoryUnitOfWorkGenerator;
@ -128,7 +128,7 @@ public class DatastoreFileManager {
List<DataFileType> allFilesAllSubtasksDataFileTypes = new ArrayList<>(dataFileTypes);
allFilesAllSubtasksDataFileTypes.removeAll(filePerSubtaskDataFileTypes);
UnitOfWork uow = pipelineTask.uowTaskInstance();
UnitOfWork uow = pipelineTask.getUnitOfWork();
// Generate sets of DataFilesForDataFileType instances. These provide the necessary
// information for mapping files in the datastore into the files needed by each
@ -248,13 +248,12 @@ public class DatastoreFileManager {
.collect(Collectors.toSet());
// Find the consumers that correspond to the definition node of the current task.
List<Long> consumersWithMatchingPipelineNode = pipelineTaskOperations()
.taskIdsForPipelineDefinitionNode(pipelineTask);
List<PipelineTask> consumersWithMatchingPipelineNode = pipelineTaskOperations()
.tasksForPipelineDefinitionNode(pipelineTask);
// Obtain the Set of datastore files that are in the relativizedFilePaths collection
// and which have a consumer that matches the pipeline definition node of the
// current
// pipeline task.
// current pipeline task.
Set<String> namesOfFilesAlreadyProcessed = datastoreProducerConsumerOperations()
.filesConsumedByTasks(consumersWithMatchingPipelineNode, relativizedFilePaths);

View File

@ -18,7 +18,7 @@ import java.util.TreeSet;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import org.apache.commons.collections.CollectionUtils;
import org.apache.commons.collections4.CollectionUtils;
import org.apache.commons.lang3.StringUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

View File

@ -22,8 +22,8 @@ import gov.nasa.ziggy.module.PipelineException;
import gov.nasa.ziggy.pipeline.definition.PipelineTask;
import gov.nasa.ziggy.pipeline.xml.HasXmlSchemaFilename;
import gov.nasa.ziggy.pipeline.xml.ValidatingXmlManager;
import gov.nasa.ziggy.services.alert.Alert.Severity;
import gov.nasa.ziggy.services.alert.AlertService;
import gov.nasa.ziggy.services.alert.AlertService.Severity;
import gov.nasa.ziggy.services.config.PropertyName;
import gov.nasa.ziggy.services.config.ZiggyConfiguration;
import gov.nasa.ziggy.util.AcceptableCatchBlock;
@ -141,8 +141,8 @@ public class Acknowledgement implements HasXmlSchemaFilename {
if (!Files.exists(acknowledgementPath)) {
return false;
}
ValidatingXmlManager<Acknowledgement> xmlManager;
xmlManager = new ValidatingXmlManager<>(Acknowledgement.class);
ValidatingXmlManager<Acknowledgement> xmlManager = new ValidatingXmlManager<>(
Acknowledgement.class);
Acknowledgement ack = xmlManager.unmarshal(acknowledgementPath.toFile());
return ack.transferStatus.equals(DataReceiptStatus.VALID);
}
@ -171,12 +171,12 @@ public class Acknowledgement implements HasXmlSchemaFilename {
*
* @param manifest {@link Manifest} file to be acknowledged.
* @param dir Location of the files referenced in the manifest.
* @param taskId ID of the {@link PipelineTask} that is performing the manifest validation.
* @param pipelineTask {@link PipelineTask} that is performing the manifest validation.
* @return {@link Acknowledgement} that includes validation status of all files referenced in
* the manifest.
*/
@AcceptableCatchBlock(rationale = Rationale.EXCEPTION_CHAIN)
public static Acknowledgement of(Manifest manifest, Path dir, long taskId) {
public static Acknowledgement of(Manifest manifest, Path dir, PipelineTask pipelineTask) {
Acknowledgement acknowledgement = new Acknowledgement(manifest.getChecksumType());
acknowledgement.setDatasetId(manifest.getDatasetId());
@ -211,8 +211,8 @@ public class Acknowledgement implements HasXmlSchemaFilename {
// failures.
if (validationFailures == 0) {
AlertService.getInstance()
.generateAndBroadcastAlert("Data Receipt", taskId, Severity.WARNING,
"File validation errors encountered");
.generateAndBroadcastAlert("Data Receipt", pipelineTask,
Severity.WARNING, "File validation errors encountered");
}
validationFailures++;
@ -221,7 +221,7 @@ public class Acknowledgement implements HasXmlSchemaFilename {
// which files failed prior to termination.
if (validationFailures >= maxValidationFailures) {
AlertService.getInstance()
.generateAndBroadcastAlert("Data Receipt", taskId, Severity.ERROR,
.generateAndBroadcastAlert("Data Receipt", pipelineTask, Severity.ERROR,
"Exceeded " + maxValidationFailures + ", terminating");
break;
}

View File

@ -39,7 +39,7 @@ public class DataReceiptOperations extends DatabaseOperations {
for (PipelineInstance pipelineInstance : pipelineInstances) {
DataReceiptInstance dataReceiptInstance = new DataReceiptInstance();
dataReceiptInstance.setInstanceId(pipelineInstance.getId());
dataReceiptInstance.setDate(pipelineInstance.getStartProcessingTime());
dataReceiptInstance.setDate(pipelineInstance.getCreated());
dataReceiptInstance.setFailedImportCount(
failedImportCrud().retrieveCountForInstance(pipelineInstance.getId()));

View File

@ -27,8 +27,8 @@ import gov.nasa.ziggy.module.PipelineException;
import gov.nasa.ziggy.pipeline.definition.PipelineModule;
import gov.nasa.ziggy.pipeline.definition.PipelineTask;
import gov.nasa.ziggy.pipeline.definition.ProcessingStep;
import gov.nasa.ziggy.services.alert.Alert.Severity;
import gov.nasa.ziggy.services.alert.AlertService;
import gov.nasa.ziggy.services.alert.AlertService.Severity;
import gov.nasa.ziggy.services.config.PropertyName;
import gov.nasa.ziggy.services.config.ZiggyConfiguration;
import gov.nasa.ziggy.uow.DataReceiptUnitOfWorkGenerator;
@ -86,7 +86,7 @@ public class DataReceiptPipelineModule extends PipelineModule {
// Get the top-level DR directory and the datastore root directory
ImmutableConfiguration config = ZiggyConfiguration.getInstance();
dataReceiptDir = config.getString(PropertyName.DATA_RECEIPT_DIR.property());
UnitOfWork uow = pipelineTask.uowTaskInstance();
UnitOfWork uow = pipelineTask.getUnitOfWork();
dataReceiptTopLevelPath = Paths.get(dataReceiptDir).toAbsolutePath();
dataImportPathForTask = dataImportPathForTask(uow);
}
@ -111,9 +111,9 @@ public class DataReceiptPipelineModule extends PipelineModule {
}
if (!containsNonHiddenFiles) {
log.warn("Directory " + dataImportPathForTask.toString()
+ " contains no files, skipping DR");
alertService().generateAndBroadcastAlert("DR", pipelineTask.getId(), Severity.WARNING,
log.warn("Directory {} contains no files, skipping DR",
dataImportPathForTask.toString());
alertService().generateAndBroadcastAlert("DR", pipelineTask, Severity.WARNING,
"Directory " + dataImportPathForTask.toString() + " contains no files");
return true;
}

View File

@ -13,7 +13,7 @@ import java.util.Map;
import java.util.Set;
import java.util.stream.Collectors;
import org.apache.commons.collections.CollectionUtils;
import org.apache.commons.collections4.CollectionUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
@ -24,8 +24,8 @@ import gov.nasa.ziggy.models.ModelImporter;
import gov.nasa.ziggy.models.ModelOperations;
import gov.nasa.ziggy.pipeline.definition.ModelType;
import gov.nasa.ziggy.pipeline.definition.PipelineTask;
import gov.nasa.ziggy.services.alert.Alert.Severity;
import gov.nasa.ziggy.services.alert.AlertService;
import gov.nasa.ziggy.services.alert.AlertService.Severity;
import gov.nasa.ziggy.services.config.DirectoryProperties;
import gov.nasa.ziggy.util.AcceptableCatchBlock;
import gov.nasa.ziggy.util.AcceptableCatchBlock.Rationale;
@ -132,7 +132,7 @@ public class DatastoreDirectoryDataReceiptDefinition implements DataReceiptDefin
/** Generates acknowledgement and returns true if transfer status is VALID. */
private boolean acknowledgeManifest() {
acknowledgement = Acknowledgement.of(manifest, dataImportPath, pipelineTask.getId());
acknowledgement = Acknowledgement.of(manifest, dataImportPath, pipelineTask);
// Write the acknowledgement to the directory.
acknowledgement.write(dataImportPath);
@ -175,7 +175,8 @@ public class DatastoreDirectoryDataReceiptDefinition implements DataReceiptDefin
// be the set of all names in the manifest)
List<String> namesOfValidFiles = acknowledgement.namesOfValidFiles();
Map<Path, Path> regularFilesInDirTree = ZiggyFileUtils.regularFilesInDirTree(dataImportPath);
Map<Path, Path> regularFilesInDirTree = ZiggyFileUtils
.regularFilesInDirTree(dataImportPath);
List<String> filenamesInDirTree = regularFilesInDirTree.keySet()
.stream()
.map(Path::toString)
@ -283,7 +284,7 @@ public class DatastoreDirectoryDataReceiptDefinition implements DataReceiptDefin
move(file, destinationFile);
successfulImports.add(datastoreRoot.relativize(destinationFile));
} catch (IOException e) {
log.error("Failed to import file " + file.toString(), e);
log.error("Failed to import file {}", file.toString(), e);
exceptionCount++;
failedImports.add(datastoreRoot.relativize(destinationFile));
}
@ -333,8 +334,8 @@ public class DatastoreDirectoryDataReceiptDefinition implements DataReceiptDefin
failedImports.addAll(importer.getFailedImports());
log.warn("{} out of {} model files failed to import", importer.getFailedImports(),
modelFilesForImport.size());
alertService().generateAndBroadcastAlert("Data Receipt (DR)", pipelineTask.getId(),
AlertService.Severity.WARNING, "Failed to import " + importer.getFailedImports()
alertService().generateAndBroadcastAlert("Data Receipt (DR)", pipelineTask,
Severity.WARNING, "Failed to import " + importer.getFailedImports()
+ " model files (out of " + modelFilesForImport.size() + ")");
}
}

View File

@ -33,6 +33,9 @@ import jakarta.persistence.Table;
* {@link ModelRegistry} of the current versions of all models that is provided to a
* {@link PipelineInstance} when the instance is created, and which can be exposed by the instance
* report.
* <p>
* A non-producing consumer is a consumer that failed to produce results from processing. It is
* indicated by the negative of the task ID.
*
* @author PT
*/
@ -61,14 +64,22 @@ public class DatastoreProducerConsumer {
public DatastoreProducerConsumer() {
}
public DatastoreProducerConsumer(long producerId, String filename) {
public DatastoreProducerConsumer(PipelineTask producerPipelineTask, Path datastoreFile) {
this(producerPipelineTask, datastoreFile.toString());
}
public DatastoreProducerConsumer(PipelineTask producerPipelineTask, String filename) {
this(toProducerId(producerPipelineTask), filename);
}
private DatastoreProducerConsumer(long producerId, String filename) {
checkNotNull(filename, "filename");
this.filename = filename;
this.producerId = producerId;
}
public DatastoreProducerConsumer(PipelineTask pipelineTask, Path datastoreFile) {
this(pipelineTask.getId(), datastoreFile.toString());
private static long toProducerId(PipelineTask producerPipelineTask) {
return producerPipelineTask != null ? producerPipelineTask.getId() : 0;
}
public void setFilename(String filename) {
@ -83,8 +94,8 @@ public class DatastoreProducerConsumer {
return producerId;
}
public void setProducer(long producer) {
producerId = producer;
public void setProducer(PipelineTask producer) {
producerId = toProducerId(producer);
}
public Set<Long> getConsumers() {
@ -97,11 +108,19 @@ public class DatastoreProducerConsumer {
return consumers.stream().map(Math::abs).collect(Collectors.toSet());
}
public void setConsumers(Set<Long> consumers) {
this.consumers.addAll(consumers);
public void addConsumer(PipelineTask consumingPipelineTask) {
addConsumer(toProducerId(consumingPipelineTask));
}
public void addConsumer(long consumer) {
/**
* Adds the given consumer to this object. A non-producing consumer is a consumer that failed to
* produce results from processing.
*/
public void addNonProducingConsumer(PipelineTask consumingPipelineTask) {
addConsumer(-toProducerId(consumingPipelineTask));
}
private void addConsumer(long consumer) {
consumers.add(consumer);
}

View File

@ -10,7 +10,7 @@ import java.util.Set;
import java.util.function.Function;
import java.util.stream.Collectors;
import org.apache.commons.collections.CollectionUtils;
import org.apache.commons.collections4.CollectionUtils;
import gov.nasa.ziggy.crud.AbstractCrud;
import gov.nasa.ziggy.crud.ZiggyQuery;
@ -61,7 +61,7 @@ public class DatastoreProducerConsumerCrud extends AbstractCrud<DatastoreProduce
List<DatastoreProducerConsumer> datastoreProducerConsumers = retrieveOrCreate(pipelineTask,
datastoreNames(datastoreFiles));
for (DatastoreProducerConsumer datastoreProducerConsumer : datastoreProducerConsumers) {
datastoreProducerConsumer.setProducer(pipelineTask.getId());
datastoreProducerConsumer.setProducer(pipelineTask);
merge(datastoreProducerConsumer);
}
}
@ -72,8 +72,8 @@ public class DatastoreProducerConsumerCrud extends AbstractCrud<DatastoreProduce
}
/** Retrieves the set of names of datastore files consumed by a specified pipeline task. */
public Set<String> retrieveFilesConsumedByTask(long taskId) {
return retrieveFilesConsumedByTasks(Set.of(taskId), null);
public Set<String> retrieveFilesConsumedByTask(PipelineTask pipelineTask) {
return retrieveFilesConsumedByTasks(Set.of(pipelineTask), null);
}
/**
@ -82,16 +82,27 @@ public class DatastoreProducerConsumerCrud extends AbstractCrud<DatastoreProduce
* filenames collection will be included in the return; otherwise, all filenames that have a
* consumer from the collection of consumer IDs will be included.
*/
public Set<String> retrieveFilesConsumedByTasks(Collection<Long> consumerIds,
public Set<String> retrieveFilesConsumedByTasks(Collection<PipelineTask> consumers,
Collection<String> filenames) {
if (CollectionUtils.isEmpty(filenames)) {
return new HashSet<>(list(filesConsumedByTasksQuery(consumers)));
}
return new HashSet<>(chunkedQuery(new ArrayList<>(filenames),
chunk -> list(
filesConsumedByTasksQuery(consumers).column(DatastoreProducerConsumer_.filename)
.in(chunk))));
}
private ZiggyQuery<DatastoreProducerConsumer, String> filesConsumedByTasksQuery(
Collection<PipelineTask> consumers) {
ZiggyQuery<DatastoreProducerConsumer, String> query = createZiggyQuery(
DatastoreProducerConsumer.class, String.class);
query.select(DatastoreProducerConsumer_.filename).distinct(true);
if (!CollectionUtils.isEmpty(filenames)) {
query.column(DatastoreProducerConsumer_.filename).chunkedIn(filenames);
}
addConsumerIdPredicates(query, consumerIds);
return new HashSet<>(list(query));
addConsumerIdPredicates(query,
consumers.stream().map(PipelineTask::getId).collect(Collectors.toList()));
return query;
}
/**
@ -139,18 +150,21 @@ public class DatastoreProducerConsumerCrud extends AbstractCrud<DatastoreProduce
Set<String> producedFileDatastoreNames = datastoreNames(producedFiles);
// Find the unique set of pipeline tasks that produced the output files.
ZiggyQuery<DatastoreProducerConsumer, Long> producerQuery = createZiggyQuery(
DatastoreProducerConsumer.class, Long.class);
producerQuery.column(DatastoreProducerConsumer_.filename)
.chunkedIn(producedFileDatastoreNames);
producerQuery.column(DatastoreProducerConsumer_.producerId).select().distinct(true);
List<Long> producerIds = chunkedQuery(new ArrayList<>(producedFileDatastoreNames),
chunk -> list(createZiggyQuery(DatastoreProducerConsumer.class, Long.class)
.column(DatastoreProducerConsumer_.filename)
.in(chunk)
.column(DatastoreProducerConsumer_.producerId)
.select()
.distinct(true)));
// Find and return the files that were consumed by the tasks that produced the
// outputs.
ZiggyQuery<DatastoreProducerConsumer, String> query = createZiggyQuery(
DatastoreProducerConsumer.class, String.class);
query.column(DatastoreProducerConsumer_.filename).select();
addConsumerIdPredicates(query, list(producerQuery));
DatastoreProducerConsumer.class, String.class)
.column(DatastoreProducerConsumer_.filename)
.select();
addConsumerIdPredicates(query, producerIds);
return list(query);
}
@ -160,13 +174,12 @@ public class DatastoreProducerConsumerCrud extends AbstractCrud<DatastoreProduce
return;
}
List<DatastoreProducerConsumer> dpcs = retrieveOrCreate(null, datastoreNames);
dpcs.stream().forEach(s -> addConsumer(s, pipelineTask.getId()));
dpcs.stream().forEach(s -> addConsumer(s, pipelineTask));
}
/**
* Adds a non-producing consumer to each of a set of datastore files. A non-producing consumer
* is a consumer that failed to produce results from processing. It is indicated by the negative
* of the task ID.
* is a consumer that failed to produce results from processing.
*/
public void addNonProducingConsumer(PipelineTask pipelineTask, Set<String> datastoreNames) {
if (datastoreNames == null || datastoreNames.isEmpty()) {
@ -174,16 +187,25 @@ public class DatastoreProducerConsumerCrud extends AbstractCrud<DatastoreProduce
}
List<DatastoreProducerConsumer> datastoreProducerConsumers = retrieveOrCreate(null,
datastoreNames);
datastoreProducerConsumers.stream().forEach(dpc -> addConsumer(dpc, -pipelineTask.getId()));
datastoreProducerConsumers.stream()
.forEach(dpc -> addNonProducingConsumer(dpc, pipelineTask));
}
/**
* Adds a consumer ID and updates or creates a {@link DatastoreProducerConsumer} instance.
* Implemented as a private method so that a stream forEach operation can apply it.
* Adds a consumer to the given {@link DatastoreProducerConsumer} instance.
*/
private void addConsumer(DatastoreProducerConsumer datastoreProducerConsumer,
long pipelineTaskId) {
datastoreProducerConsumer.addConsumer(pipelineTaskId);
PipelineTask pipelineTask) {
datastoreProducerConsumer.addConsumer(pipelineTask);
merge(datastoreProducerConsumer);
}
/**
* Adds a consumer to the given {@link DatastoreProducerConsumer} instance.
*/
private void addNonProducingConsumer(DatastoreProducerConsumer datastoreProducerConsumer,
PipelineTask pipelineTask) {
datastoreProducerConsumer.addNonProducingConsumer(pipelineTask);
merge(datastoreProducerConsumer);
}
@ -199,29 +221,29 @@ public class DatastoreProducerConsumerCrud extends AbstractCrud<DatastoreProduce
* @param pipelineTask Producer task for constructed entries, if null a value of zero will be
* used for the producer ID of constructed entries
* @param filenames Names of datastore files to be located in the database table of
* {@link DatastoreProducerConsumer} instances. Must be mutable.
* {@link DatastoreProducerConsumer} instances.
* @return A {@link List} of {@link DatastoreProducerConsumer} instances, with the database
* versions for files that have database entries and new instances for those that do not.
* versions for files that have database entries and new instances for those that do not
*/
protected List<DatastoreProducerConsumer> retrieveOrCreate(PipelineTask pipelineTask,
Set<String> filenames) {
Set<String> allFilenames = new HashSet<>(filenames);
// Start by finding all the files that already have entries.
ZiggyQuery<DatastoreProducerConsumer, DatastoreProducerConsumer> query = createZiggyQuery(
DatastoreProducerConsumer.class);
query.column(DatastoreProducerConsumer_.filename).chunkedIn(allFilenames);
List<DatastoreProducerConsumer> datastoreProducerConsumers = list(query);
Set<String> allFilenames = new HashSet<>(filenames);
List<DatastoreProducerConsumer> datastoreProducerConsumers = chunkedQuery(
new ArrayList<>(allFilenames),
chunk -> list(createZiggyQuery(DatastoreProducerConsumer.class)
.column(DatastoreProducerConsumer_.filename)
.in(chunk)));
List<String> locatedFilenames = datastoreProducerConsumers.stream()
.map(DatastoreProducerConsumer::getFilename)
.collect(Collectors.toList());
// For all the filenames that lack entries, construct DatastoreProducerConsumer instances
long producerId = pipelineTask != null ? pipelineTask.getId() : 0;
// For all the filenames that lack entries, construct DatastoreProducerConsumer instances.
allFilenames.removeAll(locatedFilenames);
for (String filename : allFilenames) {
DatastoreProducerConsumer instance = new DatastoreProducerConsumer(producerId,
DatastoreProducerConsumer instance = new DatastoreProducerConsumer(pipelineTask,
filename);
persist(instance);
datastoreProducerConsumers.add(instance);
@ -276,15 +298,12 @@ public class DatastoreProducerConsumerCrud extends AbstractCrud<DatastoreProduce
Map<String, Path> datastoreFileByName = datastoreFiles.stream()
.collect(Collectors.toMap(Path::toString, Function.identity()));
Set<String> datastoreNames = new HashSet<>(datastoreFileByName.keySet());
if (datastoreNames.iterator().hasNext()) {
}
// There doesn't seem to be a better way to do this.
datastoreNames
.removeAll(list(createZiggyQuery(DatastoreProducerConsumer.class, String.class)
datastoreNames.removeAll(chunkedQuery(new ArrayList<>(datastoreNames),
chunk -> list(createZiggyQuery(DatastoreProducerConsumer.class, String.class)
.column(DatastoreProducerConsumer_.filename)
.chunkedIn(datastoreNames)
.select()));
.in(chunk)
.select())));
return datastoreNames.stream().map(datastoreFileByName::get).collect(Collectors.toSet());
}

View File

@ -21,10 +21,10 @@ public class DatastoreProducerConsumerOperations extends DatabaseOperations {
private DatastoreProducerConsumerCrud datastoreProducerConsumerCrud = new DatastoreProducerConsumerCrud();
private PipelineTaskCrud pipelineTaskCrud = new PipelineTaskCrud();
public Set<String> filesConsumedByTasks(Collection<Long> consumerIds,
public Set<String> filesConsumedByTasks(Collection<PipelineTask> consumers,
Collection<String> filenames) {
return performTransaction(() -> datastoreProducerConsumerCrud()
.retrieveFilesConsumedByTasks(consumerIds, filenames));
.retrieveFilesConsumedByTasks(consumers, filenames));
}
public void createOrUpdateProducer(PipelineTask pipelineTask, Collection<Path> files) {

View File

@ -104,19 +104,20 @@ public abstract class Metric implements Serializable {
*/
public static void log() {
long now = System.currentTimeMillis();
metricsLogger.info("SNAPSHOT-START@" + now);
metricsLogger.info("SNAPSHOT-START@{}", now);
Iterator<String> it = Metric.metricsIterator();
logWithIterator(it);
metricsLogger.info("SNAPSHOT-END@" + now);
metricsLogger.info("SNAPSHOT-END@{}", now);
}
/**
* Dump all metrics to stdout
*/
public static void dump() {
Metric.dump(new PrintWriter(new OutputStreamWriter(System.out, ZiggyFileUtils.ZIGGY_CHARSET)));
Metric.dump(
new PrintWriter(new OutputStreamWriter(System.out, ZiggyFileUtils.ZIGGY_CHARSET)));
}
/**
@ -197,13 +198,13 @@ public abstract class Metric implements Serializable {
for (String metricName : metricsToMerge.keySet()) {
Metric metricToMerge = metricsToMerge.get(metricName);
log.debug("merge: metricToMerge=" + metricToMerge);
log.debug("metricToMerge={}", metricToMerge);
Metric existingGlobalMetric = globalMetrics.get(metricName);
if (existingGlobalMetric != null) {
log.debug("merge: existingGlobalMetric(BEFORE)=" + existingGlobalMetric);
log.debug("existingGlobalMetric(BEFORE)={}", existingGlobalMetric);
existingGlobalMetric.merge(metricToMerge);
log.debug("merge: existingGlobalMetric(AFTER)=" + existingGlobalMetric);
log.debug("existingGlobalMetric(AFTER)={}", existingGlobalMetric);
} else {
log.debug("No existingGlobalMetric exists, adding");
globalMetrics.put(metricName, metricToMerge.makeCopy());
@ -212,9 +213,9 @@ public abstract class Metric implements Serializable {
if (Metric.threadMetricsEnabled()) {
Metric existingThreadMetric = Metric.getThreadMetric(metricName);
if (existingThreadMetric != null) {
log.debug("merge: existingThreadMetric(BEFORE)=" + existingThreadMetric);
log.debug("existingThreadMetric(BEFORE)={}", existingThreadMetric);
existingThreadMetric.merge(metricToMerge);
log.debug("merge: existingThreadMetric(AFTER)=" + existingThreadMetric);
log.debug("existingThreadMetric(AFTER)={}", existingThreadMetric);
} else {
log.debug("No existingThreadMetric exists, adding");
Metric.addNewThreadMetric(metricToMerge.makeCopy());
@ -361,12 +362,12 @@ public abstract class Metric implements Serializable {
*/
protected static void log(String prefix) {
long now = System.currentTimeMillis();
metricsLogger.info("SNAPSHOT-START@" + now);
metricsLogger.info("SNAPSHOT-START@{}", now);
Iterator<String> it = Metric.metricsIterator(prefix);
logWithIterator(it);
metricsLogger.info("SNAPSHOT-END@" + now);
metricsLogger.info("SNAPSHOT-END@{}", now);
}
/**
@ -376,14 +377,14 @@ public abstract class Metric implements Serializable {
*/
protected static void log(List<String> prefixes) {
long now = System.currentTimeMillis();
metricsLogger.info("SNAPSHOT-START@" + now);
metricsLogger.info("SNAPSHOT-START@{}", now);
for (String prefix : prefixes) {
Iterator<String> it = Metric.metricsIterator(prefix);
logWithIterator(it);
}
metricsLogger.info("SNAPSHOT-END@" + now);
metricsLogger.info("SNAPSHOT-END@{}", now);
}
/**
@ -395,7 +396,7 @@ public abstract class Metric implements Serializable {
while (it.hasNext()) {
String name = it.next();
Metric metric = Metric.getGlobalMetric(name);
metricsLogger.debug(metric.getName() + ":" + metric.getLogString());
metricsLogger.debug("{}:{}", metric.getName(), metric.getLogString());
}
}
@ -432,7 +433,7 @@ public abstract class Metric implements Serializable {
@Override
public void remove() {
throw new UnsupportedOperationException("this is a read-only iterator");
throw new UnsupportedOperationException("This is a read-only iterator");
}
@Override

View File

@ -20,8 +20,7 @@ public class MetricsCrud extends AbstractCrud<MetricType> {
return list(createZiggyQuery(MetricType.class));
}
public List<MetricValue> metricValues(MetricType metricType, Date start,
Date end) {
public List<MetricValue> metricValues(MetricType metricType, Date start, Date end) {
ZiggyQuery<MetricValue, MetricValue> query = createZiggyQuery(MetricValue.class);
query.column(MetricValue_.timestamp).between(start, end).ascendingOrder();
query.column(MetricValue_.metricType).in(metricType);
@ -42,7 +41,7 @@ public class MetricsCrud extends AbstractCrud<MetricType> {
}
public long deleteOldMetrics(int maxRows) {
log.info("Preparing to delete old rows from PI_METRIC_VALUE. maxRows = " + maxRows);
log.info("Preparing to delete old rows from PI_METRIC_VALUE, maxRows={}", maxRows);
long rowCount = 0;
long numRowsOverLimit = 0;
@ -52,10 +51,10 @@ public class MetricsCrud extends AbstractCrud<MetricType> {
rowCount = retrieveMetricValueRowCount();
numRowsOverLimit = rowCount - maxRows;
log.info("rowCount = " + rowCount);
log.info("rowCount={}", rowCount);
if (numRowsOverLimit > 0) {
log.info("numRowsOverLimit = " + numRowsOverLimit);
log.info("numRowsOverLimit={}", numRowsOverLimit);
long minId = retrieveMinimumId();
long idToDelete = minId + numRowsOverLimit - 1;
@ -65,8 +64,7 @@ public class MetricsCrud extends AbstractCrud<MetricType> {
query.where(builder.lessThanOrEqualTo(root.get("id"), idToDelete));
int numUpdatedThisChunk = executeUpdate(query);
log.info(
"deleted " + numUpdatedThisChunk + " rows (where id <= " + idToDelete + ")");
log.info("Deleted {} rows (where id <= {})", numUpdatedThisChunk, idToDelete);
numUpdated += numUpdatedThisChunk;
} else {
@ -74,7 +72,7 @@ public class MetricsCrud extends AbstractCrud<MetricType> {
}
} while (numRowsOverLimit > 0);
log.info("deleted a total of " + numUpdated + " rows.");
log.info("Deleted a total of {} rows.", numUpdated);
return numUpdated;
}

View File

@ -86,7 +86,8 @@ public class MetricsDumper implements Runnable {
FileOutputStream fout = new FileOutputStream(metricsFile, true /* append mode */);
BufferedOutputStream bout = new BufferedOutputStream(fout, BUF_SIZE_BYTES);
countOut = new CountingOutputStream(bout);
printWriter = new PrintWriter(new OutputStreamWriter(countOut, ZiggyFileUtils.ZIGGY_CHARSET));
printWriter = new PrintWriter(
new OutputStreamWriter(countOut, ZiggyFileUtils.ZIGGY_CHARSET));
} catch (IOException e) {
throw new UncheckedIOException("IOException occurred on file " + metricsFile.toString(),
e);

View File

@ -23,8 +23,7 @@ public class MetricsOperations extends DatabaseOperations {
return performTransaction(() -> metricsCrud().retrieveAllMetricTypes());
}
public List<MetricValue> metricValues(MetricType metricType, Date start,
Date end) {
public List<MetricValue> metricValues(MetricType metricType, Date start, Date end) {
return performTransaction(() -> metricsCrud().metricValues(metricType, start, end));
}

View File

@ -5,10 +5,9 @@ import java.util.List;
import java.util.Map;
import java.util.Objects;
import gov.nasa.ziggy.pipeline.definition.PipelineTask;
import gov.nasa.ziggy.pipeline.definition.PipelineTaskMetrics;
import gov.nasa.ziggy.pipeline.definition.PipelineTaskDisplayData;
import gov.nasa.ziggy.pipeline.definition.PipelineTaskMetric;
import gov.nasa.ziggy.pipeline.definition.database.PipelineTaskOperations;
import gov.nasa.ziggy.util.dispmod.DisplayModel;
/**
* Compute the time spent on the specified category for a list of tasks and the percentage of the
@ -18,13 +17,13 @@ import gov.nasa.ziggy.util.dispmod.DisplayModel;
*/
public class TaskMetrics {
private final Map<String, TimeAndPercentile> categoryMetrics = new HashMap<>();
private final List<PipelineTask> pipelineTasks;
private final List<PipelineTaskDisplayData> pipelineTasks;
private TimeAndPercentile unallocatedTime = null;
private long totalProcessingTimeMillis;
private PipelineTaskOperations pipelineTaskOperations = new PipelineTaskOperations();
public TaskMetrics(List<PipelineTask> tasks) {
pipelineTasks = tasks;
public TaskMetrics(List<PipelineTaskDisplayData> taskListForModule) {
pipelineTasks = taskListForModule;
}
public void calculate() {
@ -32,13 +31,11 @@ public class TaskMetrics {
Map<String, Long> allocatedTimeByCategory = new HashMap<>();
if (pipelineTasks != null) {
for (PipelineTask task : pipelineTasks) {
totalProcessingTimeMillis += DisplayModel.getProcessingMillis(
task.getStartProcessingTime(), task.getEndProcessingTime());
for (PipelineTaskDisplayData task : pipelineTasks) {
totalProcessingTimeMillis += task.getExecutionClock().totalExecutionTime();
List<PipelineTaskMetrics> summaryMetrics = pipelineTaskOperations()
.summaryMetrics(task);
for (PipelineTaskMetrics metrics : summaryMetrics) {
List<PipelineTaskMetric> pipelineTaskMetrics = task.getPipelineTaskMetrics();
for (PipelineTaskMetric metrics : pipelineTaskMetrics) {
String category = metrics.getCategory();
Long categoryTimeMillis = allocatedTimeByCategory.get(category);
if (categoryTimeMillis == null) {
@ -81,6 +78,10 @@ public class TaskMetrics {
return totalProcessingTimeMillis;
}
PipelineTaskOperations pipelineTaskOperations() {
return pipelineTaskOperations;
}
@Override
public int hashCode() {
return Objects.hash(categoryMetrics, totalProcessingTimeMillis, unallocatedTime);
@ -99,8 +100,4 @@ public class TaskMetrics {
&& totalProcessingTimeMillis == other.totalProcessingTimeMillis
&& Objects.equals(unallocatedTime, other.unallocatedTime);
}
PipelineTaskOperations pipelineTaskOperations() {
return pipelineTaskOperations;
}
}

View File

@ -4,6 +4,22 @@ import java.util.Objects;
public class TimeAndPercentile {
private final long timeMillis;
private final double percent;
public TimeAndPercentile(long timeMillis, double percent) {
this.timeMillis = timeMillis;
this.percent = percent;
}
public long getTimeMillis() {
return timeMillis;
}
public double getPercent() {
return percent;
}
@Override
public int hashCode() {
return Objects.hash(percent, timeMillis);
@ -21,20 +37,4 @@ public class TimeAndPercentile {
return Double.doubleToLongBits(percent) == Double.doubleToLongBits(other.percent)
&& timeMillis == other.timeMillis;
}
private final long timeMillis;
private final double percent;
public TimeAndPercentile(long timeMillis, double percent) {
this.timeMillis = timeMillis;
this.percent = percent;
}
public long getTimeMillis() {
return timeMillis;
}
public double getPercent() {
return percent;
}
}

View File

@ -6,12 +6,15 @@ import gov.nasa.ziggy.pipeline.PipelineReportGenerator;
import gov.nasa.ziggy.pipeline.definition.PipelineInstance;
import gov.nasa.ziggy.pipeline.definition.PipelineInstanceNode;
import gov.nasa.ziggy.pipeline.definition.PipelineTask;
import gov.nasa.ziggy.pipeline.definition.PipelineTaskDisplayData;
import gov.nasa.ziggy.pipeline.definition.database.PipelineInstanceNodeOperations;
import gov.nasa.ziggy.pipeline.definition.database.PipelineTaskDisplayDataOperations;
import gov.nasa.ziggy.util.dispmod.TasksDisplayModel;
public class AppendixReport extends Report {
private PipelineInstanceNodeOperations pipelineInstanceNodeOperations = new PipelineInstanceNodeOperations();
private final PipelineInstanceNodeOperations pipelineInstanceNodeOperations = new PipelineInstanceNodeOperations();
private final PipelineTaskDisplayDataOperations pipelineTaskDisplayDataOperations = new PipelineTaskDisplayDataOperations();
public AppendixReport(PdfRenderer pdfRenderer) {
super(pdfRenderer);
@ -20,8 +23,10 @@ public class AppendixReport extends Report {
public void generateReport(PipelineInstance instance, List<PipelineInstanceNode> nodes) {
List<PipelineTask> tasks = pipelineInstanceNodeOperations().pipelineTasks(nodes);
List<PipelineTaskDisplayData> taskData = pipelineTaskDisplayDataOperations()
.pipelineTaskDisplayData(tasks);
TasksDisplayModel tasksDisplayModel = new TasksDisplayModel(tasks);
TasksDisplayModel tasksDisplayModel = new TasksDisplayModel(taskData);
float[] colsWidth = { 0.5f, 0.5f, 2f, 1f, 1f, 0.5f, 1.5f };
printDisplayModel("Appendix A: All Tasks", tasksDisplayModel, colsWidth);
@ -37,4 +42,8 @@ public class AppendixReport extends Report {
PipelineInstanceNodeOperations pipelineInstanceNodeOperations() {
return pipelineInstanceNodeOperations;
}
PipelineTaskDisplayDataOperations pipelineTaskDisplayDataOperations() {
return pipelineTaskDisplayDataOperations;
}
}

View File

@ -46,7 +46,7 @@ public class InstanceMetricsReport {
/**
* Top-level directory that contains the task files. Assumes that this directory contains all of
* the task directories, which in turn contain all of the sub-task directories
* the task directories, which in turn contain all of the subtask directories
*/
private File rootDirectory = null;
@ -66,9 +66,9 @@ public class InstanceMetricsReport {
private final TopNList instanceTopNList = new TopNList(TOP_N_INSTANCE);
// Map<taskDirName,List<execTime>> - Complete list of exec times, by sky group
private final Map<String, List<Double>> subTaskExecTimesByTask = new HashMap<>();
private final Map<String, List<Double>> subtaskExecTimesByTask = new HashMap<>();
private final List<Double> subTaskExecTimes = new ArrayList<>(200000);
private final List<Double> subtaskExecTimes = new ArrayList<>(200000);
private PdfRenderer instancePdfRenderer;
private PdfRenderer taskPdfRenderer;
@ -76,7 +76,7 @@ public class InstanceMetricsReport {
public InstanceMetricsReport(File rootDirectory) {
if (rootDirectory == null || !rootDirectory.isDirectory()) {
throw new IllegalArgumentException(
"rootDirectory does not exist or is not a directory: " + rootDirectory);
"rootDirectory " + rootDirectory + " does not exist or is not a directory ");
}
this.rootDirectory = rootDirectory;
}
@ -94,13 +94,13 @@ public class InstanceMetricsReport {
parseFiles();
log.info("Instance Metrics");
log.info("Instance metrics");
dump(instanceMetrics);
dumpTopTen(instancePdfRenderer, "Top " + TOP_N_INSTANCE + " for instance: ",
dumpTopTen(instancePdfRenderer, "Top " + TOP_N_INSTANCE + " for instance ",
instanceTopNList);
JFreeChart histogram = generateHistogram("instance", subTaskExecTimes);
JFreeChart histogram = generateHistogram("instance", subtaskExecTimes);
if (histogram != null) {
// chart2Png(histogram,
@ -124,7 +124,7 @@ public class InstanceMetricsReport {
.listFiles((FileFilter) f -> f.getName().contains("-matlab-") && f.isDirectory());
for (File taskDir : taskDirs) {
log.info("Processing: " + taskDir);
log.info("Processing {}", taskDir);
String taskDirName = taskDir.getName();
Map<String, Metric> taskMetrics = taskMetricsMap.get(taskDirName);
@ -140,19 +140,19 @@ public class InstanceMetricsReport {
&& pathname.isDirectory());
if (subtaskDirs == null) {
log.info("No sub-task directories found");
log.info("No subtask directories found");
return;
}
log.info("Found " + subtaskDirs.length + " sub-task directories");
log.info("Found {} subtask directories", subtaskDirs.length);
for (File subTaskDir : subtaskDirs) {
File subTaskMetricsFile = new File(subTaskDir, METRICS_FILE_NAME);
for (File subtaskDir : subtaskDirs) {
File subtaskMetricsFile = new File(subtaskDir, METRICS_FILE_NAME);
if (subTaskMetricsFile.exists()) {
Map<String, Metric> subTaskMetrics = Metric
.loadMetricsFromSerializedFile(subTaskMetricsFile);
if (subtaskMetricsFile.exists()) {
Map<String, Metric> subtaskMetrics = Metric
.loadMetricsFromSerializedFile(subtaskMetricsFile);
for (Metric metric : subTaskMetrics.values()) {
for (Metric metric : subtaskMetrics.values()) {
// merge this metric into the instance metrics
merge(metric, instanceMetrics);
// merge this metric into the task metrics
@ -162,21 +162,21 @@ public class InstanceMetricsReport {
IntervalMetric totalTimeMetric = (IntervalMetric) metric;
int execTime = (int) totalTimeMetric.getAverage();
instanceTopNList.add(execTime,
taskDirName + "/" + subTaskDir.getName());
taskTopNList.add(execTime, taskDirName + "/" + subTaskDir.getName());
taskDirName + "/" + subtaskDir.getName());
taskTopNList.add(execTime, taskDirName + "/" + subtaskDir.getName());
addExecTime(taskDirName, totalTimeMetric.getAverage());
}
}
} else {
log.warn("No metrics file found in: " + subTaskDir);
log.warn("No metrics file found in {}", subtaskDir);
}
}
log.info("Metrics for: " + taskDirName);
dumpTopTen(taskPdfRenderer, "Top " + TOP_N_TASKS + " for task: " + taskDirName,
log.info("Metrics for {}", taskDirName);
dumpTopTen(taskPdfRenderer, "Top " + TOP_N_TASKS + " for task " + taskDirName,
taskTopNList);
List<Double> taskExecTimes = subTaskExecTimesByTask.get(taskDirName);
List<Double> taskExecTimes = subtaskExecTimesByTask.get(taskDirName);
JFreeChart histogram = generateHistogram(taskDirName, taskExecTimes);
if (histogram != null) {
@ -191,15 +191,15 @@ public class InstanceMetricsReport {
}
private void addExecTime(String taskDirName, double execTime) {
List<Double> timesForTask = subTaskExecTimesByTask.get(taskDirName);
List<Double> timesForTask = subtaskExecTimesByTask.get(taskDirName);
if (timesForTask == null) {
timesForTask = new ArrayList<>(5000);
subTaskExecTimesByTask.put(taskDirName, timesForTask);
subtaskExecTimesByTask.put(taskDirName, timesForTask);
}
double timeHours = execTime / 1000.0 / 3600.0; // convert to hours
timesForTask.add(timeHours);
subTaskExecTimes.add(timeHours);
subtaskExecTimes.add(timeHours);
}
private double[] listToArray(List<Double> list) {
@ -229,7 +229,7 @@ public class InstanceMetricsReport {
dataset.addSeries("execTime", values, NUM_BINS);
JFreeChart chart = ChartFactory.createHistogram("Algorithm Run-time (" + label + ")",
"execTime (hours)", "Number of Sub-tasks", dataset, PlotOrientation.VERTICAL, true,
"execTime (hours)", "Number of Subtasks", dataset, PlotOrientation.VERTICAL, true,
true, false);
XYPlot plot = (XYPlot) chart.getPlot();
plot.setDomainPannable(true);
@ -249,10 +249,10 @@ public class InstanceMetricsReport {
private JFreeChart generateBoxAndWhiskers() {
DefaultBoxAndWhiskerCategoryDataset dataset = new DefaultBoxAndWhiskerCategoryDataset();
Set<String> taskNames = subTaskExecTimesByTask.keySet();
Set<String> taskNames = subtaskExecTimesByTask.keySet();
for (String taskName : taskNames) {
log.info("taskDirName = " + taskName);
List<Double> execTimesForTask = subTaskExecTimesByTask.get(taskName);
log.info("taskDirName={}", taskName);
List<Double> execTimesForTask = subtaskExecTimesByTask.get(taskName);
dataset.add(BoxAndWhiskerCalculator.calculateBoxAndWhiskerStatistics(execTimesForTask),
taskName, taskName);
}
@ -279,7 +279,7 @@ public class InstanceMetricsReport {
private void dump(Map<String, Metric> metrics) {
for (Metric metric : metrics.values()) {
log.info(metric.getName() + ": " + metric.toString());
log.info("{}: {}", metric.getName(), metric.toString());
}
}

View File

@ -13,8 +13,9 @@ import gov.nasa.ziggy.module.PipelineCategories;
import gov.nasa.ziggy.pipeline.definition.PipelineInstance;
import gov.nasa.ziggy.pipeline.definition.PipelineInstanceNode;
import gov.nasa.ziggy.pipeline.definition.PipelineTask;
import gov.nasa.ziggy.pipeline.definition.PipelineTaskMetrics;
import gov.nasa.ziggy.pipeline.definition.database.PipelineInstanceNodeOperations;
import gov.nasa.ziggy.pipeline.definition.PipelineTaskMetric;
import gov.nasa.ziggy.pipeline.definition.database.PipelineTaskDataOperations;
import gov.nasa.ziggy.pipeline.definition.database.PipelineTaskDisplayDataOperations;
import gov.nasa.ziggy.pipeline.definition.database.PipelineTaskOperations;
import gov.nasa.ziggy.util.dispmod.InstancesDisplayModel;
import gov.nasa.ziggy.util.dispmod.TaskSummaryDisplayModel;
@ -22,8 +23,9 @@ import gov.nasa.ziggy.util.dispmod.TaskSummaryDisplayModel;
public class InstanceReport extends Report {
private static final Logger log = LoggerFactory.getLogger(InstanceReport.class);
private PipelineInstanceNodeOperations pipelineInstanceNodeOperations = new PipelineInstanceNodeOperations();
private PipelineTaskOperations pipelineTaskOperations = new PipelineTaskOperations();
private PipelineTaskDataOperations pipelineTaskDataOperations = new PipelineTaskDataOperations();
private PipelineTaskDisplayDataOperations pipelineTaskDisplayDataOperations = new PipelineTaskDisplayDataOperations();
public InstanceReport(PdfRenderer pdfRenderer) {
super(pdfRenderer);
@ -48,28 +50,23 @@ public class InstanceReport extends Report {
timeTable.setWidthPercentage(100);
addCell(timeTable, "Start", true);
addCell(timeTable, "End", true);
addCell(timeTable, "Total", true);
addCell(timeTable, dateToDateString(instance.getStartProcessingTime()), false);
addCell(timeTable, dateToDateString(instance.getEndProcessingTime()), false);
String elapsedTime = instance.elapsedTime();
addCell(timeTable, elapsedTime, false);
addCell(timeTable, dateToDateString(instance.getCreated()), false);
addCell(timeTable, instance.getExecutionClock().toString(), false);
pdfRenderer.add(timeTable);
pdfRenderer.println();
// Instance Summary
InstancesDisplayModel instancesDisplayModel = new InstancesDisplayModel(instance);
printDisplayModel("", instancesDisplayModel);
printDisplayModel("", new InstancesDisplayModel(instance));
pdfRenderer.println();
// Task Summary
TaskSummaryDisplayModel tasksDisplayModel = new TaskSummaryDisplayModel(
pipelineInstanceNodeOperations().taskCounts(nodes));
printDisplayModel("Pipeline Task Summary", tasksDisplayModel);
printDisplayModel("Pipeline Task Summary",
new TaskSummaryDisplayModel(pipelineTaskDisplayDataOperations().taskCounts(nodes)));
pdfRenderer.println();
@ -123,9 +120,10 @@ public class InstanceReport extends Report {
List<PipelineTask> tasks = pipelineTaskOperations().allPipelineTasks();
for (PipelineTask task : tasks) {
List<PipelineTaskMetrics> taskMetrics = pipelineTaskOperations().summaryMetrics(task);
List<PipelineTaskMetric> taskMetrics = pipelineTaskDataOperations()
.pipelineTaskMetrics(task);
for (PipelineTaskMetrics taskMetric : taskMetrics) {
for (PipelineTaskMetric taskMetric : taskMetrics) {
if (taskMetric.getCategory().equals(sizeCategory)) {
sizeStats.addValue(taskMetric.getValue());
} else if (taskMetric.getCategory().equals(timeCategory)) {
@ -139,9 +137,9 @@ public class InstanceReport extends Report {
double bytesPerSecondForNode = bytesForNode / (millisForNode / 1000);
log.info("bytesForNode = " + bytesForNode);
log.info("millisForNode = " + millisForNode);
log.info("bytesPerSecondForNode = " + bytesPerSecondForNode);
log.info("bytesForNode = {}", bytesForNode);
log.info("millisForNode = {}", millisForNode);
log.info("bytesPerSecondForNode = {}", bytesPerSecondForNode);
addCell(transfersTable, node.getModuleName());
addCell(transfersTable, label);
@ -149,11 +147,15 @@ public class InstanceReport extends Report {
addCell(transfersTable, rateFormatter.format(bytesPerSecondForNode));
}
private PipelineInstanceNodeOperations pipelineInstanceNodeOperations() {
return pipelineInstanceNodeOperations;
}
private PipelineTaskOperations pipelineTaskOperations() {
return pipelineTaskOperations;
}
private PipelineTaskDataOperations pipelineTaskDataOperations() {
return pipelineTaskDataOperations;
}
private PipelineTaskDisplayDataOperations pipelineTaskDisplayDataOperations() {
return pipelineTaskDisplayDataOperations;
}
}

View File

@ -68,7 +68,7 @@ public class MatlabMetrics {
topTen = new TopNList(10);
if (cacheFile.exists()) {
log.info("Found cache file");
log.info("Found cache file {}", cacheFile);
try (ObjectInputStream ois = new ObjectInputStream(
new BufferedInputStream(new FileInputStream(cacheFile)))) {
CacheContents cacheContents = (CacheContents) ois.readObject();
@ -84,34 +84,34 @@ public class MatlabMetrics {
&& f.isDirectory());
for (File taskDir : taskDirs) {
log.info("Processing: " + taskDir);
log.info("Processing {}", taskDir);
SubtaskDirectoryIterator directoryIterator = new SubtaskDirectoryIterator(
taskDir);
if (directoryIterator.hasNext()) {
log.info("Found " + directoryIterator.numSubTasks()
+ " sub-task directories");
log.info("Found {} subtask directories",
directoryIterator.numSubtasks());
} else {
log.info("No sub-task directories found");
log.info("No subtask directories found");
}
while (directoryIterator.hasNext()) {
File subTaskDir = directoryIterator.next().getSubtaskDir();
File subtaskDir = directoryIterator.next().getSubtaskDir();
log.debug("STM: " + subTaskDir);
log.debug("STM {}", subtaskDir);
File subTaskMetricsFile = new File(subTaskDir, MATLAB_METRICS_FILENAME);
File subtaskMetricsFile = new File(subtaskDir, MATLAB_METRICS_FILENAME);
if (subTaskMetricsFile.exists()) {
Map<String, Metric> subTaskMetrics = Metric
.loadMetricsFromSerializedFile(subTaskMetricsFile);
if (subtaskMetricsFile.exists()) {
Map<String, Metric> subtaskMetrics = Metric
.loadMetricsFromSerializedFile(subtaskMetricsFile);
for (String metricName : subTaskMetrics.keySet()) {
for (String metricName : subtaskMetrics.keySet()) {
if (!metricName.equals(MATLAB_CONTROLLER_EXEC_TIME_METRIC)) {
Metric metric = subTaskMetrics.get(metricName);
Metric metric = subtaskMetrics.get(metricName);
log.debug("STM: " + metricName + ": " + metric.toString());
log.debug("STM {}={}", metricName, metric.toString());
DescriptiveStatistics metricStats = functionStats
.get(metricName);
@ -125,22 +125,22 @@ public class MatlabMetrics {
}
}
Metric metric = subTaskMetrics
Metric metric = subtaskMetrics
.get(MATLAB_CONTROLLER_EXEC_TIME_METRIC);
if (metric != null) {
String subTaskName = subTaskDir.getParentFile().getName() + "/"
+ subTaskDir.getName();
String subtaskName = subtaskDir.getParentFile().getName() + "/"
+ subtaskDir.getName();
IntervalMetric totalTimeMetric = (IntervalMetric) metric;
double mean = totalTimeMetric.getAverage();
totalTimeStats.addValue(mean);
topTen.add((long) mean, subTaskName);
topTen.add((long) mean, subtaskName);
} else {
log.warn("no metric found with name: "
+ MATLAB_CONTROLLER_EXEC_TIME_METRIC + " in:" + subTaskDir);
log.warn("No metric found with name {} in {}",
MATLAB_CONTROLLER_EXEC_TIME_METRIC, subtaskDir);
}
} else {
log.warn("No metrics file found in: " + subTaskDir);
log.warn("No metrics file found in {}", subtaskDir);
}
}
}

View File

@ -47,12 +47,12 @@ public class MatlabReport extends Report {
DefaultPieDataset functionBreakdownDataset = new DefaultPieDataset();
log.info("breakdown report");
log.info("Breakdown report");
for (String metricName : matlabFunctionStats.keySet()) {
String label = shortMetricName(metricName);
log.info("processing metric: " + label);
log.info("Processing metric {}", label);
DescriptiveStatistics functionStats = matlabFunctionStats.get(metricName);
double functionTime = functionStats.getSum();
@ -75,7 +75,7 @@ public class MatlabReport extends Report {
HumanReadableStatistics values = millisToHumanReadable(matlabStats);
JFreeChart execHistogram = generateHistogram("MATLAB Controller Run Time",
"Time (" + values.getUnit() + ")", "Sub-Tasks", values.getValues(), 100);
"Time (" + values.getUnit() + ")", "Subtasks", values.getValues(), 100);
if (execHistogram != null) {
pdfRenderer.printChart(execHistogram, CHART3_WIDTH, CHART3_HEIGHT);
@ -99,22 +99,22 @@ public class MatlabReport extends Report {
TopNList topTen = new TopNList(10);
for (File taskDir : taskDirs) {
log.info("Processing: " + taskDir);
log.info("Processing {}", taskDir);
Memdrone memdrone = new Memdrone(moduleName, instanceId);
Map<String, DescriptiveStatistics> taskStats = memdrone.statsByPid();
Map<String, String> pidMap = memdrone.subTasksByPid();
Map<String, String> pidMap = memdrone.subtasksByPid();
Set<String> pids = taskStats.keySet();
for (String pid : pids) {
String subTaskName = pidMap.get(pid);
if (subTaskName == null) {
subTaskName = "?:" + pid;
String subtaskName = pidMap.get(pid);
if (subtaskName == null) {
subtaskName = "?:" + pid;
}
double max = taskStats.get(pid).getMax();
memoryStats.addValue(max);
topTen.add((long) max, subTaskName);
topTen.add((long) max, subtaskName);
}
}

View File

@ -165,12 +165,11 @@ public class Memdrone {
Path memdronePath = latestMemdronePath();
commandLine.addArgument(memdronePath.toString());
if (watchdogMap.containsKey(nameRoot)) {
log.info("Memdrone for module " + binaryName + ", instance " + instanceId
+ " already running");
log.info("Memdrone for module {}, instance {} already running", binaryName, instanceId);
return;
}
log.info("Starting memdrone for module " + binaryName + " in instance " + instanceId);
log.info("Starting memdrone for module {} in instance {}", binaryName, instanceId);
ExternalProcess memdroneProcess = ExternalProcess.simpleExternalProcess(commandLine);
memdroneProcess.execute(false);
watchdogMap.put(nameRoot, memdroneProcess.getWatchdog());
@ -181,13 +180,13 @@ public class Memdrone {
*/
public void stopMemdrone() {
if (watchdogMap.containsKey(nameRoot)) {
log.info("Stopping memdrone for module " + binaryName + " in instance " + instanceId);
log.info("Stopping memdrone for module {} in instance {}", binaryName, instanceId);
watchdogMap.get(nameRoot).destroyProcess();
log.info("Memdrone stopped");
watchdogMap.remove(nameRoot);
} else {
log.info("No memdrone script was running for module " + binaryName + " in instance "
+ instanceId);
log.info("No memdrone script was running for module {} in instance {}", binaryName,
instanceId);
}
}
@ -239,19 +238,19 @@ public class Memdrone {
justification = SpotBugsUtils.DESERIALIZATION_JUSTIFICATION)
@AcceptableCatchBlock(rationale = Rationale.EXCEPTION_CHAIN)
@AcceptableCatchBlock(rationale = Rationale.CAN_NEVER_OCCUR)
public Map<String, String> subTasksByPid() {
public Map<String, String> subtasksByPid() {
Path cacheFile = latestMemdronePath().resolve(PID_MAP_CACHE_FILENAME);
Map<String, String> pidToSubTask = null;
Map<String, String> pidToSubtask = null;
if (Files.exists(cacheFile)) {
log.info("Using pid cache file");
log.info("Using PID cache file {}", cacheFile);
try (ObjectInputStream ois = new ObjectInputStream(
new BufferedInputStream(new FileInputStream(cacheFile.toFile())))) {
@SuppressWarnings("unchecked")
Map<String, String> obj = (Map<String, String>) ois.readObject();
pidToSubTask = obj;
pidToSubtask = obj;
log.debug("pid cache: " + obj);
log.debug("PID cache {}", obj);
} catch (IOException e) {
throw new UncheckedIOException(
"IOException occurred reading from " + cacheFile.toString(), e);
@ -261,11 +260,11 @@ public class Memdrone {
throw new AssertionError(e);
}
} else { // no cache
log.info("Creating pid cache file");
pidToSubTask = createPidMapCache();
log.info("Creating PID cache file");
pidToSubtask = createPidMapCache();
}
return pidToSubTask;
return pidToSubtask;
}
/**
@ -281,10 +280,10 @@ public class Memdrone {
.listFiles((FileFilter) f -> f.getName().startsWith("memdrone-") && f.isFile());
if (memdroneLogs != null) {
log.info("Number of memdrone-* files found: " + memdroneLogs.length);
log.info("Found {} memdrone-* files", memdroneLogs.length);
for (File memdroneLog : memdroneLogs) {
log.info("Processing: " + memdroneLog);
log.info("Processing {}", memdroneLog);
String filename = memdroneLog.getName();
String host = filename.substring(filename.indexOf("-") + 1, filename.indexOf("."));
@ -318,9 +317,9 @@ public class Memdrone {
@AcceptableCatchBlock(rationale = Rationale.EXCEPTION_CHAIN)
public Map<String, String> createPidMapCache() {
Path cacheFile = latestMemdronePath().resolve(PID_MAP_CACHE_FILENAME);
Map<String, String> pidToSubTask = new HashMap<>();
Map<String, String> pidToSubtask = new HashMap<>();
log.info("cacheFile: " + cacheFile);
log.info("cacheFile={}", cacheFile);
Pattern taskPattern = Pattern
.compile(binaryName + "-" + Long.toString(instanceId) + "-" + "\\d+");
@ -331,22 +330,22 @@ public class Memdrone {
.listFiles((FileFilter) f -> taskPattern.matcher(f.getName()).matches());
if (taskDirs != null) {
for (File taskDir : taskDirs) {
log.info("Processing task " + taskDir.getName());
log.info("Processing task {}", taskDir.getName());
Matcher taskMatcher = taskPattern.matcher(taskDir.getName());
taskMatcher.matches();
String taskDirName = taskDir.getName();
taskDirName = taskDirName.substring(taskDirName.lastIndexOf("-") + 1);
// Find the subtask directories and loop over them.
File[] subTaskDirs = taskDir
File[] subtaskDirs = taskDir
.listFiles((FileFilter) f -> f.getName().contains("st-") && f.isDirectory());
if (subTaskDirs != null) {
for (File subTaskDir : subTaskDirs) {
String subTaskId = taskDirName + "/" + subTaskDir.getName();
log.debug("processing subTaskId: " + subTaskId);
if (subtaskDirs != null) {
for (File subtaskDir : subtaskDirs) {
String subtaskId = taskDirName + "/" + subtaskDir.getName();
log.debug("Processing subtaskId {}", subtaskId);
// Get the contents of the PID filename and populate the map
File pidsFile = new File(subTaskDir,
File pidsFile = new File(subtaskDir,
MatlabJavaInitialization.MATLAB_PIDS_FILENAME);
if (pidsFile.exists()) {
String hostPid;
@ -358,7 +357,7 @@ public class Memdrone {
"Unable to read from file " + pidsFile.toString(), e);
}
pidToSubTask.put(hostPid, subTaskId);
pidToSubtask.put(hostPid, subtaskId);
}
}
}
@ -369,12 +368,12 @@ public class Memdrone {
ObjectOutputStream oos = new ObjectOutputStream(new BufferedOutputStream(
new FileOutputStream(cacheFile.toAbsolutePath().toString())))) {
oos.writeObject(pidToSubTask);
oos.writeObject(pidToSubtask);
oos.flush();
} catch (IOException e) {
throw new UncheckedIOException("Unable to write to file " + cacheFile.toString(), e);
}
return pidToSubTask;
return pidToSubtask;
}
private Date date() {

View File

@ -37,18 +37,18 @@ public class MemdroneLog {
public MemdroneLog(File memdroneLogFile) {
if (!memdroneLogFile.exists()) {
throw new PipelineException(
"Specified memdrone file does not exist: " + memdroneLogFile);
"Specified memdrone file " + memdroneLogFile + " does not exist");
}
if (!memdroneLogFile.isFile()) {
throw new PipelineException(
"Specified memdrone file is not a regular file: " + memdroneLogFile);
"Specified memdrone file " + memdroneLogFile + " is not a regular file");
}
try {
input = new FileInputStream(memdroneLogFile);
} catch (FileNotFoundException e) {
throw new UncheckedIOException("File " + memdroneLogFile.toString() + " not found", e);
throw new UncheckedIOException("File " + memdroneLogFile + " not found", e);
}
this.memdroneLogFile = memdroneLogFile;
parse();
@ -61,7 +61,7 @@ public class MemdroneLog {
@AcceptableCatchBlock(rationale = Rationale.EXCEPTION_CHAIN)
private void parse() {
log.info("Parse started");
log.info("Parsing");
logContents = new HashMap<>();
@ -87,9 +87,9 @@ public class MemdroneLog {
"IOException occurred reading from " + memdroneLogFile.toString(), e);
}
log.info("Parse complete");
log.info("lineCount: " + lineCount);
log.info("skipCount: " + skipCount);
log.info("Parsing...done");
log.info("lineCount = {}", lineCount);
log.info("skipCount = {}", skipCount);
}
public Map<String, DescriptiveStatistics> getLogContents() {

View File

@ -52,7 +52,7 @@ public class MemdroneSample {
String[] elements = memdroneLogLine.split("\\s+");
if (elements.length != 9) {
log.warn("Parse error, num elements != 11 : " + memdroneLogLine);
log.warn("Parse error, {} elements != 9", memdroneLogLine);
return false;
}
String timestampString = elements[0] + " " + // day of week
@ -68,7 +68,7 @@ public class MemdroneSample {
timestampMillis = parseDate(timestampString);
memoryKilobytes = Integer.parseInt(elements[7]);
} catch (ParseException | NumberFormatException e) {
log.warn("Parse error: " + e);
log.warn("Parse error {}", e);
return false;
}
return true;

View File

@ -16,8 +16,9 @@ import com.lowagie.text.pdf.PdfPTable;
import gov.nasa.ziggy.pipeline.definition.PipelineInstanceNode;
import gov.nasa.ziggy.pipeline.definition.PipelineTask;
import gov.nasa.ziggy.pipeline.definition.PipelineTaskMetrics;
import gov.nasa.ziggy.pipeline.definition.PipelineTaskMetrics.Units;
import gov.nasa.ziggy.pipeline.definition.PipelineTaskMetric;
import gov.nasa.ziggy.pipeline.definition.PipelineTaskMetric.Units;
import gov.nasa.ziggy.pipeline.definition.database.PipelineTaskDataOperations;
import gov.nasa.ziggy.pipeline.definition.database.PipelineTaskOperations;
public class NodeReport extends Report {
@ -27,6 +28,7 @@ public class NodeReport extends Report {
private Map<String, TopNList> categoryTopTen;
private Map<String, Units> categoryUnits;
private PipelineTaskOperations pipelineTaskOperations = new PipelineTaskOperations();
private PipelineTaskDataOperations pipelineTaskDataOperations = new PipelineTaskDataOperations();
public NodeReport(PdfRenderer pdfRenderer) {
super(pdfRenderer);
@ -44,11 +46,11 @@ public class NodeReport extends Report {
orderedCategoryNames = new LinkedList<>();
Map<PipelineTask, List<PipelineTaskMetrics>> taskMetricsByTask = pipelineTaskOperations()
Map<PipelineTask, List<PipelineTaskMetric>> taskMetricsByTask = pipelineTaskDataOperations()
.taskMetricsByTask(node);
for (PipelineTask task : taskMetricsByTask.keySet()) {
for (PipelineTaskMetrics taskMetric : taskMetricsByTask.get(task)) {
for (PipelineTaskMetric taskMetric : taskMetricsByTask.get(task)) {
String category = taskMetric.getCategory();
categoryUnits.put(category, taskMetric.getUnits());
@ -75,7 +77,7 @@ public class NodeReport extends Report {
categoryTopTen.put(category, topTen);
}
topTen.add(value, "ID: " + task.getId());
topTen.add(value, "ID: " + task);
List<PipelineTaskMetricValue> valueList = categoryMetrics.get(category);
@ -84,21 +86,21 @@ public class NodeReport extends Report {
categoryMetrics.put(category, valueList);
}
valueList.add(new PipelineTaskMetricValue(task.getId(), value));
valueList.add(new PipelineTaskMetricValue(task, value));
}
}
DefaultCategoryDataset categoryTaskDataset = new DefaultCategoryDataset();
log.info("summary report");
log.info("Summary report");
for (String category : orderedCategoryNames) {
log.info("processing category: " + category);
log.info("Processing category {}", category);
if (categoryIsTime(category)) {
List<PipelineTaskMetricValue> values = categoryMetrics.get(category);
for (PipelineTaskMetricValue value : values) {
Long taskId = value.getPipelineTaskId();
Long taskId = value.getPipelineTask().getId();
Long valueMillis = value.getMetricValue();
double valueMins = valueMillis / (1000.0 * 60);
categoryTaskDataset.addValue(valueMins, category, taskId);
@ -113,7 +115,7 @@ public class NodeReport extends Report {
pdfRenderer.newPage();
// task breakdown table
// Task breakdown table
pdfRenderer.printText("Wall Time Breakdown by Task and Category", PdfRenderer.h1Font);
pdfRenderer.println();
@ -179,6 +181,10 @@ public class NodeReport extends Report {
return pipelineTaskOperations;
}
PipelineTaskDataOperations pipelineTaskDataOperations() {
return pipelineTaskDataOperations;
}
/**
* Container for the ID of a {@link PipelineTask} and a metric value.
*
@ -186,16 +192,16 @@ public class NodeReport extends Report {
*/
private static class PipelineTaskMetricValue {
private final long pipelineTaskId;
private final PipelineTask pipelineTask;
private final long metricValue;
public PipelineTaskMetricValue(long pipelineTaskId, long metricValue) {
this.pipelineTaskId = pipelineTaskId;
public PipelineTaskMetricValue(PipelineTask pipelineTask, long metricValue) {
this.pipelineTask = pipelineTask;
this.metricValue = metricValue;
}
public long getPipelineTaskId() {
return pipelineTaskId;
public PipelineTask getPipelineTask() {
return pipelineTask;
}
public long getMetricValue() {

View File

@ -102,7 +102,7 @@ public class PerformanceReport {
"Unable to create directory " + outputPath.getParent().toString(), e);
}
log.info("Writing report to {}...", outputPath.toString());
log.info("Writing report to {}", outputPath.toString());
PdfRenderer pdfRenderer = new PdfRenderer(outputPath.toFile(), false);
@ -149,7 +149,7 @@ public class PerformanceReport {
throw new PipelineException("Invalid node range: " + nodes);
}
log.info("processing nodes " + startNode + " to " + endNode);
log.info("Processing nodes {} to {}", startNode, endNode);
return instanceNodes.subList(startNode, endNode + 1);
}
@ -175,14 +175,14 @@ public class PerformanceReport {
pdfRenderer.newPage();
log.info("category report");
log.info("Category report");
List<String> orderedCategoryNames = nodeReport.getOrderedCategoryNames();
Map<String, DescriptiveStatistics> categoryStats = nodeReport.getCategoryStats();
Map<String, TopNList> topTen = nodeReport.getCategoryTopTen();
for (String category : orderedCategoryNames) {
log.info("processing category: " + category);
log.info("Processing category {}", category);
boolean isTime = nodeReport.categoryIsTime(category);
CategoryReport categoryReport = new CategoryReport(category, pdfRenderer, isTime);

View File

@ -177,7 +177,7 @@ public abstract class Report {
protected void generateSummaryTable(String label, DescriptiveStatistics stats, TopNList topTen,
Format f) {
log.info("Generating report for: " + label);
log.info("Generating report for {}", label);
PdfPTable layoutTable = new PdfPTable(2);
PdfPTable statsTable = new PdfPTable(2);

View File

@ -82,10 +82,10 @@ public class ModelImporter {
*/
public void importModels(List<Path> files) {
log.info("Importing models...");
log.info("Importing models");
if (modelTypesToImport.isEmpty()) {
modelTypesToImport.addAll(modelTypes());
log.info("Retrieved " + modelTypesToImport.size() + " model types from database");
log.info("Retrieved {} model types from database", modelTypesToImport.size());
}
Map<ModelType, Map<String, Path>> modelFilesByModelType = new HashMap<>();
@ -111,7 +111,7 @@ public class ModelImporter {
long unlockedModelRegistryId = mergeRegistryAndReturnUnlockedId(modelRegistry);
log.info("Update of model registry complete");
log.info("Importing models...done");
log.info("Current unlocked model registry ID == " + unlockedModelRegistryId);
log.info("Current unlocked model registry ID is {}", unlockedModelRegistryId);
}
/**
@ -179,8 +179,8 @@ public class ModelImporter {
Set<String> modelVersions = new TreeSet<>(modelFilesByVersionId.keySet());
for (String version : modelVersions) {
createModel(modelRegistry, modelType, modelDir, modelFilesByVersionId.get(version));
log.info(modelFilesByVersionId.size() + " models of type " + modelType.getType()
+ " added to datastore");
log.info("Added {} models of type {} to datastore", modelFilesByVersionId.size(),
modelType.getType());
}
}
@ -209,7 +209,7 @@ public class ModelImporter {
currentModelRegistryMetadata);
modelMetadata.setDataReceiptTaskId(dataReceiptTaskId);
} catch (Exception e) {
log.error("Unable to create model metadata for file " + modelFile);
log.error("Unable to create model metadata for file {}", modelFile);
failedImports.add(dataImportPath.relativize(modelFile));
return;
}
@ -219,7 +219,7 @@ public class ModelImporter {
try {
move(modelFile, destinationFile);
} catch (Exception e) {
log.error("Unable to import file " + modelFile + " into datastore");
log.error("Unable to import file {} into datastore", modelFile);
failedImports.add(dataImportPath.relativize(modelFile));
return;
}
@ -227,8 +227,8 @@ public class ModelImporter {
// If all that worked, then we can update the model registry
persistModelMetadata(modelMetadata);
modelRegistry.updateModelMetadata(modelMetadata);
log.info("Imported file " + modelFile + " to models directory as "
+ modelMetadata.getDatastoreFileName() + " of type " + modelType.getType());
log.info("Imported file {} to models directory as {} of type {}", modelFile,
modelMetadata.getDatastoreFileName(), modelType.getType());
successfulImports.add(datastoreRoot.relativize(destinationFile));
}

View File

@ -1,5 +1,10 @@
package gov.nasa.ziggy.module;
import java.io.BufferedWriter;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.UncheckedIOException;
import java.lang.reflect.Constructor;
import java.lang.reflect.InvocationTargetException;
import java.nio.file.Files;
@ -13,10 +18,13 @@ import gov.nasa.ziggy.module.remote.PbsParameters;
import gov.nasa.ziggy.module.remote.SupportedRemoteClusters;
import gov.nasa.ziggy.pipeline.definition.PipelineDefinitionNodeExecutionResources;
import gov.nasa.ziggy.pipeline.definition.PipelineTask;
import gov.nasa.ziggy.pipeline.definition.TaskCounts.SubtaskCounts;
import gov.nasa.ziggy.pipeline.definition.database.PipelineTaskDataOperations;
import gov.nasa.ziggy.pipeline.definition.database.PipelineTaskOperations;
import gov.nasa.ziggy.services.config.DirectoryProperties;
import gov.nasa.ziggy.util.AcceptableCatchBlock;
import gov.nasa.ziggy.util.AcceptableCatchBlock.Rationale;;
import gov.nasa.ziggy.util.AcceptableCatchBlock.Rationale;
import gov.nasa.ziggy.util.io.ZiggyFileUtils;;
/**
* Superclass for algorithm execution. This includes local execution via the
@ -32,25 +40,32 @@ public abstract class AlgorithmExecutor {
public static final String ZIGGY_PROGRAM = "ziggy";
protected static final String NODE_MASTER_NAME = "compute-node-master";
public static final String ACTIVE_CORES_FILE_NAME = ".activeCoresPerNode";
public static final String WALL_TIME_FILE_NAME = ".requestedWallTimeSeconds";
protected final PipelineTask pipelineTask;
private PipelineTaskOperations pipelineTaskOperations = new PipelineTaskOperations();
private PipelineTaskDataOperations pipelineTaskDataOperations = new PipelineTaskDataOperations();
private StateFile stateFile;
private PbsParameters pbsParameters;
/**
* Returns a new instance of the appropriate {@link AlgorithmExecutor} subclass.
*/
public static final AlgorithmExecutor newInstance(PipelineTask pipelineTask) {
return newInstance(pipelineTask, new PipelineTaskOperations());
return newInstance(pipelineTask, new PipelineTaskOperations(),
new PipelineTaskDataOperations());
}
/**
* Version of {@link #newInstance(PipelineTask)} that accepts a user-supplied
* {@link PipelineTaskOperations} instances. Allows these classes to be mocked for testing.
*
* @param pipelineTaskDataOperations2
*/
static final AlgorithmExecutor newInstance(PipelineTask pipelineTask,
PipelineTaskOperations pipelineTaskOperations) {
PipelineTaskOperations pipelineTaskOperations,
PipelineTaskDataOperations pipelineTaskDataOperations) {
if (pipelineTask == null) {
log.debug("Pipeline task is null, returning LocalAlgorithmExecutor instance");
@ -67,15 +82,15 @@ public abstract class AlgorithmExecutor {
log.debug("Remote execution not selected, returning LocalAlgorithmExecutor instance");
return new LocalAlgorithmExecutor(pipelineTask);
}
log.debug("Total subtasks " + pipelineTask.getTotalSubtaskCount());
log.debug("Completed subtasks " + pipelineTask.getCompletedSubtaskCount());
int subtasksToRun = pipelineTask.getTotalSubtaskCount()
- pipelineTask.getCompletedSubtaskCount();
SubtaskCounts subtaskCounts = pipelineTaskDataOperations.subtaskCounts(pipelineTask);
log.debug("Total subtasks {}", subtaskCounts.getTotalSubtaskCount());
log.debug("Completed subtasks {}", subtaskCounts.getCompletedSubtaskCount());
int subtasksToRun = subtaskCounts.getTotalSubtaskCount()
- subtaskCounts.getCompletedSubtaskCount();
if (subtasksToRun < remoteParams.getMinSubtasksForRemoteExecution()) {
log.info("Number subtasks to run (" + subtasksToRun
+ ") less than min subtasks for remote execution ("
+ remoteParams.getMinSubtasksForRemoteExecution() + ")");
log.info("Executing task " + pipelineTask.getId() + " locally");
log.info("Number subtasks to run ({}) less than min subtasks for remote execution ({})",
subtasksToRun, remoteParams.getMinSubtasksForRemoteExecution());
log.info("Executing task {} locally", pipelineTask);
return new LocalAlgorithmExecutor(pipelineTask);
}
return newRemoteInstance(pipelineTask);
@ -114,45 +129,44 @@ public abstract class AlgorithmExecutor {
public void submitAlgorithm(TaskConfiguration inputsHandler) {
prepareToSubmitAlgorithm(inputsHandler);
writeActiveCoresFile();
writeWallTimeFile();
IntervalMetric.measure(PipelineMetrics.SEND_METRIC, () -> {
log.info("Submitting task for execution (taskId=" + pipelineTask.getId() + ")");
log.info("Submitting task for execution (taskId={})", pipelineTask);
Files.createDirectories(algorithmLogDir());
Files.createDirectories(DirectoryProperties.stateFilesDir());
Files.createDirectories(taskDataDir());
SubtaskUtils.clearStaleAlgorithmStates(
new TaskDirectoryManager(pipelineTask).taskDir().toFile());
log.info("Start remote monitoring (taskId=" + pipelineTask.getId() + ")");
submitForExecution(stateFile);
log.info("Start remote monitoring (taskId={})", pipelineTask);
submitForExecution();
writeQueuedTimestampFile();
return null;
});
}
private void prepareToSubmitAlgorithm(TaskConfiguration inputsHandler) {
// execute the external process on a remote host
int numSubtasks;
PbsParameters pbsParameters = null;
PipelineDefinitionNodeExecutionResources executionResources = pipelineTaskOperations()
.executionResources(pipelineTask);
int numSubtasks;
// Initial submission: this is indicated by a non-null task configuration manager
if (inputsHandler != null) { // indicates initial submission
log.info("Processing initial submission of task " + pipelineTask.getId());
log.info("Processing initial submission of task {}", pipelineTask);
numSubtasks = inputsHandler.getSubtaskCount();
pipelineTaskDataOperations().updateSubtaskCounts(pipelineTask, numSubtasks, -1, -1);
pbsParameters = generatePbsParameters(executionResources, numSubtasks);
// Resubmission: this is indicated by a null task configuration manager, which
// means that subtask counts are available in the database
} else
{
log.info("Processing resubmission of task " + pipelineTask.getId());
numSubtasks = pipelineTask.getTotalSubtaskCount();
int numCompletedSubtasks = pipelineTask.getCompletedSubtaskCount();
} else {
log.info("Processing resubmission of task {}", pipelineTask);
SubtaskCounts subtaskCounts = pipelineTaskDataOperations().subtaskCounts(pipelineTask);
numSubtasks = subtaskCounts.getTotalSubtaskCount();
int numCompletedSubtasks = subtaskCounts.getCompletedSubtaskCount();
// Scale the total subtasks to get to the number that still need to be processed
double subtaskCountScaleFactor = (double) (numSubtasks - numCompletedSubtasks)
@ -162,23 +176,24 @@ public abstract class AlgorithmExecutor {
pbsParameters = generatePbsParameters(executionResources,
(int) (numSubtasks * subtaskCountScaleFactor));
}
stateFile = StateFile.generateStateFile(pipelineTask, pbsParameters, numSubtasks);
}
/**
* Resubmit the pipeline task to the appropriate {@link AlgorithmMonitor}. This is used in the
* case where the supervisor has been stopped and restarted but tasks are still running (usually
* remotely). This notifies the monitor that there are tasks that it should be looking out for.
*/
public void resumeMonitoring() {
prepareToSubmitAlgorithm(null);
addToMonitor(stateFile);
/** Writes the number of active cores per node to a file in the task directory. */
private void writeActiveCoresFile() {
writeActiveCoresFile(workingDir(), activeCores());
}
protected abstract void addToMonitor(StateFile stateFile);
private void writeWallTimeFile() {
writeWallTimeFile(workingDir(), wallTime());
}
protected abstract void submitForExecution(StateFile stateFile);
protected abstract void addToMonitor();
protected abstract void submitForExecution();
protected abstract String activeCores();
protected abstract String wallTime();
/**
* Generates an updated instance of {@link PbsParameters}. The method is abstract because each
@ -200,22 +215,47 @@ public abstract class AlgorithmExecutor {
return taskDataDir().resolve(pipelineTask.taskBaseName());
}
protected void writeQueuedTimestampFile() {
TimestampFile.create(workingDir().toFile(), TimestampFile.Event.QUEUED);
}
public abstract AlgorithmType algorithmType();
protected PipelineTaskOperations pipelineTaskOperations() {
return pipelineTaskOperations;
}
public StateFile getStateFile() {
return stateFile;
public PbsParameters getPbsParameters() {
return pbsParameters;
}
public enum AlgorithmType {
/** local execution only */
LOCAL,
protected PipelineTaskDataOperations pipelineTaskDataOperations() {
return pipelineTaskDataOperations;
}
/** Pleiades execution (database server is inside NAS enclave) */
REMOTE
// Broken out for use in ComputeNodeMaster unit tests.
@AcceptableCatchBlock(rationale = Rationale.EXCEPTION_CHAIN)
static void writeActiveCoresFile(Path taskDir, String activeCoresPerNode) {
try (BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(taskDir.resolve(ACTIVE_CORES_FILE_NAME).toFile()),
ZiggyFileUtils.ZIGGY_CHARSET))) {
writer.write(activeCoresPerNode);
writer.newLine();
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}
// Broken out for use in ComputeNodeMaster unit tests.
@AcceptableCatchBlock(rationale = Rationale.EXCEPTION_CHAIN)
static void writeWallTimeFile(Path taskDir, String wallTime) {
try (BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(taskDir.resolve(WALL_TIME_FILE_NAME).toFile()),
ZiggyFileUtils.ZIGGY_CHARSET))) {
writer.write(wallTime);
writer.newLine();
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}
}

View File

@ -1,8 +1,6 @@
package gov.nasa.ziggy.module;
import java.io.File;
import java.io.IOException;
import java.io.UncheckedIOException;
import java.nio.file.Path;
import org.slf4j.Logger;
@ -55,17 +53,7 @@ public class AlgorithmLifecycleManager implements AlgorithmLifecycle {
@Override
@AcceptableCatchBlock(rationale = Rationale.EXCEPTION_CHAIN)
public File getTaskDir(boolean cleanExisting) {
File taskDir = allocateTaskDir(cleanExisting);
if (isRemote()) {
File stateFileLockFile = new File(taskDir, StateFile.LOCK_FILE_NAME);
try {
stateFileLockFile.createNewFile();
} catch (IOException e) {
throw new UncheckedIOException(
"Unable to create file " + stateFileLockFile.toString(), e);
}
}
return taskDir;
return allocateTaskDir(cleanExisting);
}
/*
@ -75,7 +63,7 @@ public class AlgorithmLifecycleManager implements AlgorithmLifecycle {
*/
@Override
public boolean isRemote() {
return executor.algorithmType() == AlgorithmExecutor.AlgorithmType.REMOTE;
return executor.algorithmType() == AlgorithmType.REMOTE;
}
@Override
@ -98,7 +86,7 @@ public class AlgorithmLifecycleManager implements AlgorithmLifecycle {
if (taskDir == null) {
taskDir = taskDirManager.allocateTaskDir(cleanExisting);
log.info("defaultWorkingDir = " + taskDir);
log.info("defaultWorkingDir={}", taskDir);
}
return taskDir.toFile();
}

View File

@ -1,33 +1,37 @@
package gov.nasa.ziggy.module;
import java.io.File;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import java.util.LinkedList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ScheduledThreadPoolExecutor;
import java.util.concurrent.TimeUnit;
import java.util.stream.Collectors;
import org.apache.commons.collections4.CollectionUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import gov.nasa.ziggy.module.AlgorithmExecutor.AlgorithmType;
import gov.nasa.ziggy.module.remote.PbsLogParser;
import gov.nasa.ziggy.module.remote.QueueCommandManager;
import gov.nasa.ziggy.module.remote.RemoteJobInformation;
import gov.nasa.ziggy.pipeline.PipelineExecutor;
import gov.nasa.ziggy.pipeline.definition.PipelineDefinitionNodeExecutionResources;
import gov.nasa.ziggy.pipeline.definition.PipelineModule.RunMode;
import gov.nasa.ziggy.pipeline.definition.database.PipelineTaskOperations;
import gov.nasa.ziggy.pipeline.definition.PipelineTask;
import gov.nasa.ziggy.pipeline.definition.ProcessingStep;
import gov.nasa.ziggy.pipeline.definition.TaskCounts.SubtaskCounts;
import gov.nasa.ziggy.pipeline.definition.database.PipelineTaskDataOperations;
import gov.nasa.ziggy.pipeline.definition.database.PipelineTaskOperations;
import gov.nasa.ziggy.services.alert.Alert.Severity;
import gov.nasa.ziggy.services.alert.AlertService;
import gov.nasa.ziggy.services.alert.AlertService.Severity;
import gov.nasa.ziggy.services.config.DirectoryProperties;
import gov.nasa.ziggy.services.messages.AllJobsFinishedMessage;
import gov.nasa.ziggy.services.messages.MonitorAlgorithmRequest;
import gov.nasa.ziggy.services.messages.WorkerStatusMessage;
import gov.nasa.ziggy.services.messages.TaskProcessingCompleteMessage;
import gov.nasa.ziggy.services.messaging.ZiggyMessenger;
import gov.nasa.ziggy.supervisor.PipelineSupervisor;
import gov.nasa.ziggy.util.AcceptableCatchBlock;
@ -36,58 +40,61 @@ import gov.nasa.ziggy.util.AcceptableCatchBlock.Rationale;
/**
* Monitors algorithm processing by monitoring state files.
* <p>
* Two instances are used: one for local tasks and the other for remote execution jobs (on HPC
* and/or cloud systems). Each one checks at regular intervals for updates to the {@link StateFile}
* for the specific task. The intervals are managed by a {@link ScheduledThreadPoolExecutor}.
* Each task has an assigned {@link TaskMonitor} instance that periodically counts the states of
* subtasks to determine the overall progress of that task. In addition, there is a periodic check
* of PBS log files that allows the monitor to determine whether some or all of the remote jobs for
* a given task have failed.
*
* @author Todd Klaus
* @author PT
* @author Bill Wohler
*/
public class AlgorithmMonitor implements Runnable {
private static final Logger log = LoggerFactory.getLogger(AlgorithmMonitor.class);
private static final long SSH_POLL_INTERVAL_MILLIS = 10 * 1000; // 10 secs
private static final long REMOTE_POLL_INTERVAL_MILLIS = 10 * 1000; // 10 secs
private static final long LOCAL_POLL_INTERVAL_MILLIS = 2 * 1000; // 2 seconds
private static final long FINISHED_JOBS_POLL_INTERVAL_MILLIS = 10 * 1000;
private static AlgorithmMonitor localMonitoringInstance = null;
private static AlgorithmMonitor remoteMonitoringInstance = null;
private static ScheduledThreadPoolExecutor threadPool = new ScheduledThreadPoolExecutor(2);
private ScheduledThreadPoolExecutor threadPool = new ScheduledThreadPoolExecutor(1);
private List<String> corruptedStateFileNames = new ArrayList<>();
private boolean startLogMessageWritten = false;
private AlgorithmType algorithmType;
private String monitorVersion;
private PipelineTaskOperations pipelineTaskOperations = new PipelineTaskOperations();
private PipelineTaskDataOperations pipelineTaskDataOperations = new PipelineTaskDataOperations();
private PipelineExecutor pipelineExecutor = new PipelineExecutor();
private PbsLogParser pbsLogParser = new PbsLogParser();
private QueueCommandManager queueCommandManager = QueueCommandManager.newInstance();
JobMonitor jobMonitor = null;
private final Map<PipelineTask, List<RemoteJobInformation>> jobsInformationByTask = new ConcurrentHashMap<>();
private final Map<PipelineTask, TaskMonitor> taskMonitorByTask = new ConcurrentHashMap<>();
private AllJobsFinishedMessage allJobsFinishedMessage;
private final ConcurrentHashMap<String, StateFile> state = new ConcurrentHashMap<>();
// For testing only.
private Disposition disposition;
/** What needs to be done after a task exits the state file checks loop: */
private enum Disposition {
enum Disposition {
// Algorithm processing is complete. Persist results.
PERSIST {
@Override
public void performActions(AlgorithmMonitor monitor, PipelineTask pipelineTask) {
StateFile stateFile = new StateFile(pipelineTask.getModuleName(),
pipelineTask.getPipelineInstanceId(), pipelineTask.getId())
.newStateFileFromDiskFile();
if (stateFile.getNumFailed() != 0) {
SubtaskCounts subtaskCounts = monitor.pipelineTaskDataOperations()
.subtaskCounts(pipelineTask);
if (subtaskCounts.getFailedSubtaskCount() != 0) {
log.warn("{} subtasks out of {} failed but task completed",
stateFile.getNumFailed(), stateFile.getNumComplete());
subtaskCounts.getFailedSubtaskCount(),
subtaskCounts.getTotalSubtaskCount());
monitor.alertService()
.generateAndBroadcastAlert("Algorithm Monitor", pipelineTask.getId(),
.generateAndBroadcastAlert("Algorithm Monitor", pipelineTask,
Severity.WARNING, "Failed subtasks, see logs for details");
}
if (monitor.jobsInformationByTask.containsKey(pipelineTask)) {
monitor.jobsInformationByTask.remove(pipelineTask);
}
log.info("Sending task {} to worker to persist results", pipelineTask);
log.info("Sending task with id: " + pipelineTask.getId()
+ " to worker to persist results");
monitor.pipelineExecutor()
.persistTaskResults(
monitor.pipelineTaskOperations().pipelineTask(pipelineTask.getId()));
monitor.pipelineExecutor().persistTaskResults(pipelineTask);
}
},
@ -95,17 +102,19 @@ public class AlgorithmMonitor implements Runnable {
RESUBMIT {
@Override
public void performActions(AlgorithmMonitor monitor, PipelineTask pipelineTask) {
log.warn("Resubmitting task with id: " + pipelineTask.getId()
+ " for additional processing");
log.warn("Resubmitting task {} for additional processing", pipelineTask);
monitor.alertService()
.generateAndBroadcastAlert("Algorithm Monitor", pipelineTask.getId(),
Severity.WARNING, "Resubmitting task for further processing");
PipelineTask databaseTask = monitor.pipelineTaskOperations()
.prepareTaskForAutoResubmit(pipelineTask);
.generateAndBroadcastAlert("Algorithm Monitor", pipelineTask, Severity.WARNING,
"Resubmitting task for further processing");
monitor.pipelineTaskDataOperations().prepareTaskForAutoResubmit(pipelineTask);
if (monitor.jobsInformationByTask.containsKey(pipelineTask)) {
monitor.jobsInformationByTask.remove(pipelineTask);
}
// Submit tasks for resubmission at highest priority.
monitor.pipelineExecutor()
.restartFailedTasks(List.of(databaseTask), false, RunMode.RESUBMIT);
.restartFailedTasks(List.of(pipelineTask), false, RunMode.RESUBMIT);
}
},
@ -114,10 +123,15 @@ public class AlgorithmMonitor implements Runnable {
FAIL {
@Override
public void performActions(AlgorithmMonitor monitor, PipelineTask pipelineTask) {
log.error("Task with id " + pipelineTask.getId() + " failed on remote system, "
+ "marking task as errored and not restarting.");
monitor.handleFailedTask(monitor.state.get(StateFile.invariantPart(pipelineTask)));
monitor.pipelineTaskOperations().taskErrored(pipelineTask);
log.error(
"Task {} failed on remote system, marking task as errored and not restarting",
pipelineTask);
monitor.handleFailedTask(pipelineTask);
monitor.pipelineTaskDataOperations().taskErrored(pipelineTask);
if (monitor.jobsInformationByTask.containsKey(pipelineTask)) {
monitor.jobsInformationByTask.remove(pipelineTask);
}
}
};
@ -127,180 +141,146 @@ public class AlgorithmMonitor implements Runnable {
public abstract void performActions(AlgorithmMonitor monitor, PipelineTask pipelineTask);
}
public static void initialize() {
synchronized (AlgorithmMonitor.class) {
if (localMonitoringInstance == null || remoteMonitoringInstance == null) {
ZiggyMessenger.subscribe(MonitorAlgorithmRequest.class, message -> {
if (message.getAlgorithmType().equals(AlgorithmType.LOCAL)) {
AlgorithmMonitor.startLocalMonitoring(
message.getStateFile().newStateFileFromDiskFile());
} else {
AlgorithmMonitor.startRemoteMonitoring(
message.getStateFile().newStateFileFromDiskFile());
}
});
}
if (localMonitoringInstance == null) {
localMonitoringInstance = new AlgorithmMonitor(AlgorithmType.LOCAL);
localMonitoringInstance.startMonitoringThread();
}
if (remoteMonitoringInstance == null) {
remoteMonitoringInstance = new AlgorithmMonitor(AlgorithmType.REMOTE);
remoteMonitoringInstance.startMonitoringThread();
}
// Whenever a worker sends a "last message," run an unscheduled
// update of the monitors.
ZiggyMessenger.subscribe(WorkerStatusMessage.class, message -> {
if (message.isLastMessageFromWorker()) {
localMonitoringInstance.run();
remoteMonitoringInstance.run();
}
});
}
public AlgorithmMonitor() {
ZiggyMessenger.subscribe(MonitorAlgorithmRequest.class, message -> {
addToMonitor(message);
});
ZiggyMessenger.subscribe(TaskProcessingCompleteMessage.class, message -> {
endTaskMonitoring(message.getPipelineTask());
});
startMonitoringThread();
}
/**
* Returns the collection of {@link StateFile} instances currently being tracked by the remote
* execution monitor.
* Start the monitoring thread.
*/
public static Collection<StateFile> remoteTaskStateFiles() {
if (remoteMonitoringInstance == null || remoteMonitoringInstance.state.isEmpty()) {
return null;
}
return remoteMonitoringInstance.state.values();
}
/**
* Constructor. Default scope for use in unit tests.
*/
@AcceptableCatchBlock(rationale = Rationale.MUST_NOT_CRASH)
AlgorithmMonitor(AlgorithmType algorithmType) {
this.algorithmType = algorithmType;
monitorVersion = algorithmType.name().toLowerCase();
log.info("Starting new monitor for: " + DirectoryProperties.stateFilesDir().toString());
initializeJobMonitor();
}
/**
* Start the monitoring thread for a given monitor.
*/
void startMonitoringThread() {
long pollingIntervalMillis = pollingIntervalMillis();
private void startMonitoringThread() {
long pollingIntervalMillis = finishedJobsPollingIntervalMillis();
if (pollingIntervalMillis > 0) {
log.info("Starting polling on " + monitorVersion + " with " + pollingIntervalMillis
+ " msec interval");
threadPool.scheduleWithFixedDelay(this, 0, pollingIntervalMillis(),
log.info("Starting polling with {} msec interval", REMOTE_POLL_INTERVAL_MILLIS);
threadPool.scheduleWithFixedDelay(this, 0, REMOTE_POLL_INTERVAL_MILLIS,
TimeUnit.MILLISECONDS);
}
}
private String username() {
String username = System.getenv("USERNAME");
if (username == null) {
username = System.getenv("USER");
// Protected access for unit tests.
protected final void addToMonitor(MonitorAlgorithmRequest request) {
List<RemoteJobInformation> remoteJobsInformation = request.getRemoteJobsInformation();
log.info("Starting algorithm monitoring for task {}", request.getPipelineTask());
if (!CollectionUtils.isEmpty(remoteJobsInformation)) {
jobsInformationByTask.put(request.getPipelineTask(), remoteJobsInformation);
}
return username;
TaskMonitor taskMonitor = taskMonitor(request);
taskMonitorByTask.put(request.getPipelineTask(), taskMonitor);
taskMonitor.startMonitoring();
}
public static void startLocalMonitoring(StateFile task) {
startMonitoring(task, AlgorithmType.LOCAL);
// Actions to be taken when a task monitor reports that a task is done.
// Protected access for unit tests.
protected final void endTaskMonitoring(PipelineTask pipelineTask) {
// When the worker that persists the results exits, it will cause
// this method to execute even though the algorithm is no longer under
// monitoring. In that circumstance, exit now.
if (!taskMonitorByTask.containsKey(pipelineTask)) {
return;
}
log.info("End monitoring for task {}", pipelineTask);
// update processing state
pipelineTaskDataOperations().updateProcessingStep(pipelineTask,
ProcessingStep.WAITING_TO_STORE);
// It may be the case that all the subtasks are processed, but that
// there are jobs still running, or (more likely) queued. We can address
// that by deleting them from PBS.
List<Long> jobIds = remoteJobIds(
incompleteRemoteJobs(jobsInformationByTask.get(pipelineTask)));
if (!CollectionUtils.isEmpty(jobIds)) {
queueCommandManager().deleteJobsByJobId(jobIds);
}
if (jobsInformationByTask.containsKey(pipelineTask)) {
updateRemoteJobs(pipelineTask);
}
taskMonitorByTask.remove(pipelineTask);
// Figure out what needs to happen next, and do it.
determineDisposition(pipelineTask).performActions(this, pipelineTask);
}
public static void startRemoteMonitoring(StateFile task) {
startMonitoring(task, AlgorithmType.REMOTE);
private void updateRemoteJobs(PipelineTask pipelineTask) {
pipelineTaskDataOperations().updateJobs(pipelineTask, true);
}
private static void startMonitoring(StateFile task, AlgorithmType algorithmType) {
AlgorithmMonitor instance = algorithmType.equals(AlgorithmType.REMOTE)
? remoteMonitoringInstance
: localMonitoringInstance;
instance.startMonitoring(task);
}
void startMonitoring(StateFile task) {
log.info("Starting monitoring for: " + task.invariantPart() + " on " + monitorVersion
+ " algorithm monitor");
state.put(task.invariantPart(), new StateFile(task));
jobMonitor().addToMonitoring(task);
}
private List<File> stateFiles() {
List<File> stateDirListing = new LinkedList<>();
// get the raw list, excluding directories
File stateDirFile = DirectoryProperties.stateFilesDir().toFile();
List<File> allFiles = new ArrayList<>();
if (stateDirFile != null) {
File[] files = stateDirFile.listFiles();
if (files != null) {
allFiles.addAll(Arrays.asList(files));
/**
* Returns the collection of remote job IDs that correspond to a given collection of pipeline
* tasks, as a {@link Map}.
*/
public Map<PipelineTask, List<Long>> jobIdsByTaskId(Collection<PipelineTask> pipelineTasks) {
Map<PipelineTask, List<Long>> jobIdsByTaskId = new HashMap<>();
if (jobsInformationByTask.size() == 0) {
return jobIdsByTaskId;
}
for (PipelineTask pipelineTask : pipelineTasks) {
List<RemoteJobInformation> remoteJobsInformation = jobsInformationByTask
.get(pipelineTask);
if (remoteJobsInformation != null) {
jobIdsByTaskId.put(pipelineTask,
remoteJobIds(incompleteRemoteJobs(remoteJobsInformation)));
}
stateDirListing = allFiles.stream()
.filter(s -> !s.isDirectory())
.collect(Collectors.toList());
}
return jobIdsByTaskId;
}
// throw away everything that doesn't have the correct pattern, and everything
// that is on the corrupted list
List<File> filteredFiles = stateDirListing.stream()
.filter(s -> StateFile.STATE_FILE_NAME_PATTERN.matcher(s.getName()).matches())
.filter(s -> !corruptedStateFileNames.contains(s.getName()))
private List<RemoteJobInformation> incompleteRemoteJobs(
Collection<RemoteJobInformation> remoteJobsInformation) {
List<RemoteJobInformation> incompleteRemoteJobs = new ArrayList<>();
if (CollectionUtils.isEmpty(remoteJobsInformation)) {
return incompleteRemoteJobs;
}
for (RemoteJobInformation remoteJobInformation : remoteJobsInformation) {
if (!Files.exists(Paths.get(remoteJobInformation.getLogFile()))) {
incompleteRemoteJobs.add(remoteJobInformation);
}
}
return incompleteRemoteJobs;
}
private List<Long> remoteJobIds(Collection<RemoteJobInformation> remoteJobsInformation) {
return remoteJobsInformation.stream()
.map(RemoteJobInformation::getJobId)
.collect(Collectors.toList());
return new LinkedList<>(filteredFiles);
}
private void performStateFileChecks(StateFile oldState, StateFile remoteState) {
private void checkForFinishedJobs() {
String key = remoteState.invariantPart();
if (!oldState.equals(remoteState)) {
// state change
log.info("Updating state for: " + remoteState + " (was: " + oldState + ")");
state.put(key, remoteState);
long taskId = remoteState.getPipelineTaskId();
pipelineTaskOperations().updateSubtaskCounts(taskId, remoteState.getNumTotal(),
remoteState.getNumComplete(), remoteState.getNumFailed());
if (remoteState.isRunning()) {
// update processing state
pipelineTaskOperations().updateProcessingStep(taskId, ProcessingStep.EXECUTING);
for (PipelineTask pipelineTask : jobsInformationByTask.keySet()) {
if (isFinished(pipelineTask)) {
publishFinishedJobsMessage(pipelineTask);
}
if (remoteState.isDone()) {
// update processing state
pipelineTaskOperations().updateProcessingStep(taskId,
ProcessingStep.WAITING_TO_STORE);
// It may be the case that all the subtasks are processed, but that
// there are jobs still running, or (more likely) queued. We can address
// that by deleting them from PBS.
Set<Long> jobIds = jobMonitor().allIncompleteJobIds(remoteState);
if (jobIds != null && !jobIds.isEmpty()) {
jobMonitor().getQstatCommandManager().deleteJobsByJobId(jobIds);
}
// Always send the task back to the worker
sendTaskToWorker(remoteState);
log.info("Removing monitoring for: " + key);
state.remove(key);
jobMonitor().endMonitoring(remoteState);
}
} else if (jobMonitor().isFinished(remoteState)) {
// Some job failures leave the state file untouched. If this happens, the QstatMonitor
// can determine that in fact the job is no longer running. In this case, set the state
// file to FAILED. Then in the next pass through this loop, standard handling for a
// failed job can be applied.
moveStateFileToCompleteState(remoteState);
}
}
private void publishFinishedJobsMessage(PipelineTask pipelineTask) {
allJobsFinishedMessage = new AllJobsFinishedMessage(pipelineTask);
ZiggyMessenger.publish(allJobsFinishedMessage, false);
}
private boolean isFinished(PipelineTask pipelineTask) {
List<RemoteJobInformation> remoteJobsInformation = jobsInformationByTask.get(pipelineTask);
if (CollectionUtils.isEmpty(remoteJobsInformation)) {
return false;
}
for (RemoteJobInformation remoteJobInformation : remoteJobsInformation) {
if (!Files.exists(Paths.get(remoteJobInformation.getLogFile()))) {
return false;
}
}
return true;
}
@Override
@AcceptableCatchBlock(rationale = Rationale.MUST_NOT_CRASH)
public void run() {
@ -310,118 +290,51 @@ public class AlgorithmMonitor implements Runnable {
startLogMessageWritten = true;
}
try {
if (!state.isEmpty()) {
jobMonitor().update();
List<File> stateDirListing = stateFiles();
if (log.isDebugEnabled()) {
dumpRemoteState(stateDirListing);
}
performStateFileLoop(stateDirListing);
}
checkForFinishedJobs();
} catch (Exception e) {
// We don't want transient problems with the remote monitoring tool
// (which is third party software and not under our control) to bring
// down the monitor, so we catch all exceptions here including runtime
// ones in the hope and expectation that the next time we call the
// monitor the transient problem will have resolved itself.
log.warn("Task monitor: exception has occurred", e);
}
}
/**
* Loops over all files in the stateDirListing and, if they are in the monitoring list, performs
* state file checks on the cached and file states. Any state file name that cannot be parsed
* into a new StateFile object is added to a registry of corrupted names and subsequently
* ignored.
*
* @param stateDirListing
*/
@AcceptableCatchBlock(rationale = Rationale.MUST_NOT_CRASH)
private void performStateFileLoop(List<File> stateDirListing) {
for (File remoteFile : stateDirListing) {
String name = remoteFile.getName();
try {
StateFile remoteState = new StateFile(name);
StateFile oldState = state.get(remoteState.invariantPart());
if (oldState != null) { // ignore tasks we were not
// charged with
performStateFileChecks(oldState, remoteState);
}
} catch (Exception e) {
log.error("State file with name " + name
+ " encountered exception and will be removed from monitoring", e);
corruptedStateFileNames.add(name);
}
}
}
/**
* Resubmits a task to the worker and, optionally, to the NAS. This method is called both for
* complete tasks and failing tasks, because each needs to be looked at again by the worker. If
* the task has failing subtasks which should be resubmitted, the caller should specify
* <code>true</code> for <code>restart</code>.
*
* @param remoteState the state file
* @param restart if true, resubmit the task in the NAS
*/
private void sendTaskToWorker(StateFile remoteState) {
PipelineTask pipelineTask = pipelineTaskOperations()
.pipelineTask(remoteState.getPipelineTaskId());
pipelineTaskOperations().updateJobs(pipelineTask, true);
// Perform the actions necessary based on the task disposition
determineDisposition(remoteState, pipelineTask).performActions(this, pipelineTask);
}
/**
* Handles a task for which processing has failed (either due to error or because the user
* killed it). In this case, several actions need to be taken: the information about the cause
* of the error has to be captured via qstat and logged locally; the pipeline task entry in the
* database needs its TaskExecutionLog updated; the remote state file needs to be renamed to
* indicate that the job errored.
*
* @param stateFile StateFile instance for deleted task.
*/
private void handleFailedTask(StateFile stateFile) {
private void handleFailedTask(PipelineTask pipelineTask) {
// get the exit code and comment via qstat
String exitStatus = taskStatusValues(stateFile);
String exitComment = taskCommentValues(stateFile);
String exitState = stateFile.getState().toString().toLowerCase();
// Get the exit code and comment.
String exitStatus = taskStatusValues(pipelineTask);
String exitComment = taskCommentValues(pipelineTask);
if (exitState.equals("deleted")) {
log.error("Task " + stateFile.getPipelineTaskId() + " has state file in "
+ exitState.toUpperCase() + " state");
} else {
log.error("Task " + stateFile.getPipelineTaskId() + " has failed");
exitState = "failed";
}
log.error("Task {} has failed", pipelineTask);
if (exitStatus != null) {
log.error("Exit status from remote system for all jobs: " + exitStatus);
log.error("Exit status from remote system for all jobs is {}", exitStatus);
} else {
log.error("No exit status provided");
exitStatus = "not provided";
}
if (exitComment != null) {
log.error("Exit comment from remote system: " + exitComment);
log.error("Exit comment from remote system is {}", exitComment);
} else {
log.error("No exit comment provided");
exitComment = "not provided";
}
// issue an alert about the deletion
String message = algorithmType.equals(AlgorithmType.REMOTE)
? "Task " + exitState + ", return codes = " + exitStatus + ", comments = " + exitComment
: "Task " + exitState;
alertService().generateAndBroadcastAlert("Algorithm Monitor", stateFile.getPipelineTaskId(),
AlertService.Severity.ERROR, message);
String message = pipelineTaskDataOperations()
.algorithmType(pipelineTask) == AlgorithmType.REMOTE
? "Task failed, return codes = " + exitStatus + ", comments = " + exitComment
: "Task failed";
alertService().generateAndBroadcastAlert("Algorithm Monitor", pipelineTask, Severity.ERROR,
message);
}
private String taskStatusValues(StateFile stateFile) {
Map<Long, Integer> exitStatusByJobId = jobMonitor().exitStatus(stateFile);
private String taskStatusValues(PipelineTask pipelineTask) {
Map<Long, Integer> exitStatusByJobId = pbsLogParser()
.exitStatusByJobId(jobsInformationByTask.get(pipelineTask));
if (exitStatusByJobId.isEmpty()) {
return null;
}
@ -438,13 +351,14 @@ public class AlgorithmMonitor implements Runnable {
return sb.toString();
}
private String taskCommentValues(StateFile stateFile) {
Map<Long, String> exitComment = jobMonitor().exitComment(stateFile);
if (exitComment == null || exitComment.size() == 0) {
private String taskCommentValues(PipelineTask pipelineTask) {
Map<Long, String> exitCommentByJobId = pbsLogParser()
.exitCommentByJobId(jobsInformationByTask.get(pipelineTask));
if (exitCommentByJobId == null || exitCommentByJobId.size() == 0) {
return null;
}
StringBuilder sb = new StringBuilder();
for (Map.Entry<Long, String> entry : exitComment.entrySet()) {
for (Map.Entry<Long, String> entry : exitCommentByJobId.entrySet()) {
sb.append(entry.getKey());
sb.append("(");
if (entry.getValue() != null) {
@ -456,62 +370,47 @@ public class AlgorithmMonitor implements Runnable {
return sb.toString();
}
/**
* Moves the state file for a remote job to the COMPLETE state. This is only done if the job
* ended on the remote system in a way that was not detected by the job itself but was later
* detected via qstat calls.
*
* @param stateFile State file for failed job.
*/
private void moveStateFileToCompleteState(StateFile stateFile) {
stateFile.setStateAndPersist(StateFile.State.COMPLETE);
}
private Disposition determineDisposition(StateFile state, PipelineTask pipelineTask) {
private Disposition determineDisposition(PipelineTask pipelineTask) {
// A task that was deliberately killed must be marked as failed regardless of
// how many subtasks completed.
if (taskIsKilled(pipelineTask.getId())) {
if (taskIsKilled(pipelineTask)) {
log.debug("Task {} was halted", pipelineTask.getId());
disposition = Disposition.FAIL;
return Disposition.FAIL;
}
// The total number of bad subtasks includes both the ones that failed and the
// ones that never ran / never finished. If there are few enough bad subtasks,
// then we can persist results.
PipelineDefinitionNodeExecutionResources resources = pipelineTaskOperations()
.executionResources(pipelineTask);
if (state.getNumTotal() - state.getNumComplete() <= resources.getMaxFailedSubtaskCount()) {
SubtaskCounts subtaskCounts = pipelineTaskDataOperations().subtaskCounts(pipelineTask);
log.debug("Number of subtasks for task {}: {}", pipelineTask.getId(),
subtaskCounts.getTotalSubtaskCount());
log.debug("Number of completed subtasks for task {}: {}", pipelineTask.getId(),
subtaskCounts.getCompletedSubtaskCount());
log.debug("Number of failed subtasks for task {}: {}", pipelineTask.getId(),
subtaskCounts.getFailedSubtaskCount());
if (subtaskCounts.getTotalSubtaskCount()
- subtaskCounts.getCompletedSubtaskCount() <= resources.getMaxFailedSubtaskCount()) {
disposition = Disposition.PERSIST;
return Disposition.PERSIST;
}
// If the task has bad subtasks but the number of automatic resubmits hasn't
// been exhausted, then resubmit.
if (pipelineTask.getAutoResubmitCount() < resources.getMaxAutoResubmits()) {
if (pipelineTaskDataOperations().autoResubmitCount(pipelineTask) < resources
.getMaxAutoResubmits()) {
disposition = Disposition.RESUBMIT;
return Disposition.RESUBMIT;
}
// If we've gotten this far, then the task has to be considered as failed:
// it has too many bad subtasks and has exhausted its automatic retries.
disposition = Disposition.FAIL;
return Disposition.FAIL;
}
private void dumpRemoteState(List<File> remoteState) {
log.debug("Remote state dir:");
for (File file : remoteState) {
log.debug(file.toString());
}
}
/**
* Determines whether to continue the monitoring while-loop. For testing purposes, this can be
* replaced with a mocked version that performs a finite number of loops.
*
* @return true
*/
boolean continueMonitoring() {
return true;
}
/**
* Obtains a new PipelineExecutor. Replace with mocked method for unit testing.
*
@ -527,41 +426,63 @@ public class AlgorithmMonitor implements Runnable {
}
/** Replace with mocked method for unit testing. */
boolean taskIsKilled(long taskId) {
return PipelineSupervisor.taskOnKilledTaskList(taskId);
boolean taskIsKilled(PipelineTask pipelineTask) {
return PipelineSupervisor.taskOnHaltedTaskList(pipelineTask);
}
/**
* Returns the polling interval, in milliseconds. Replace with mocked method for unit testing.
*/
long pollingIntervalMillis() {
return algorithmType.equals(AlgorithmType.REMOTE) ? SSH_POLL_INTERVAL_MILLIS
: LOCAL_POLL_INTERVAL_MILLIS;
long finishedJobsPollingIntervalMillis() {
return FINISHED_JOBS_POLL_INTERVAL_MILLIS;
}
/** Stops the thread pool and replaces it. For testing only. */
static void resetThreadPool() {
if (threadPool != null) {
threadPool.shutdownNow();
}
threadPool = new ScheduledThreadPoolExecutor(2);
long remotePollIntervalMillis() {
return REMOTE_POLL_INTERVAL_MILLIS;
}
/** Gets the {@link StateFile} from the state {@link Map}. For testing only. */
StateFile getStateFile(StateFile stateFile) {
return state.get(stateFile.invariantPart());
long localPollIntervalMillis() {
return LOCAL_POLL_INTERVAL_MILLIS;
}
private void initializeJobMonitor() {
jobMonitor = JobMonitor.newInstance(username(), algorithmType);
}
/** Replace with mocked method for unit testing. */
JobMonitor jobMonitor() {
return jobMonitor;
TaskMonitor taskMonitor(MonitorAlgorithmRequest monitorAlgorithmRequest) {
return new TaskMonitor(monitorAlgorithmRequest.getPipelineTask(),
monitorAlgorithmRequest.getTaskDir().toFile(),
CollectionUtils.isEmpty(monitorAlgorithmRequest.getRemoteJobsInformation())
? localPollIntervalMillis()
: remotePollIntervalMillis());
}
PipelineTaskOperations pipelineTaskOperations() {
return pipelineTaskOperations;
}
PipelineTaskDataOperations pipelineTaskDataOperations() {
return pipelineTaskDataOperations;
}
PbsLogParser pbsLogParser() {
return pbsLogParser;
}
QueueCommandManager queueCommandManager() {
return queueCommandManager;
}
// For testing only.
Map<PipelineTask, TaskMonitor> getTaskMonitorByTask() {
return taskMonitorByTask;
}
// For testing only.
Map<PipelineTask, List<RemoteJobInformation>> getJobsInformationByTask() {
return jobsInformationByTask;
}
// For testing only.
Disposition getDisposition() {
return disposition;
}
// For testing only.
AllJobsFinishedMessage allJobsFinishedMessage() {
return allJobsFinishedMessage;
}
}

View File

@ -12,7 +12,7 @@ import gov.nasa.ziggy.util.AcceptableCatchBlock.Rationale;
/**
* This class manages zero-length files whose names are used to represent the state of an executing
* algorithm. These files are stored in the subtask working directory.
* algorithm. These files are stored in the task and subtask working directories.
*
* @author Todd Klaus
* @author PT
@ -22,7 +22,7 @@ public class AlgorithmStateFiles {
private static final String HAS_OUTPUTS = "HAS_OUTPUTS";
public enum SubtaskState {
public enum AlgorithmState {
// State in which no AlgorithmStateFile is present. Rather than return an actual
// null, when queried about the subtask state we can return SubtaskState.NULL .
NULL {
@ -59,9 +59,9 @@ public class AlgorithmStateFiles {
private final File outputsFlag;
public AlgorithmStateFiles(File workingDir) {
processingFlag = new File(workingDir, "." + SubtaskState.PROCESSING.toString());
completeFlag = new File(workingDir, "." + SubtaskState.COMPLETE.toString());
failedFlag = new File(workingDir, "." + SubtaskState.FAILED.toString());
processingFlag = new File(workingDir, "." + AlgorithmState.PROCESSING.toString());
completeFlag = new File(workingDir, "." + AlgorithmState.COMPLETE.toString());
failedFlag = new File(workingDir, "." + AlgorithmState.FAILED.toString());
outputsFlag = new File(workingDir, "." + HAS_OUTPUTS);
}
@ -81,7 +81,7 @@ public class AlgorithmStateFiles {
/**
* Removes any "stale" state flags. A stale state flag is one from a previous processing attempt
* that will cause the pipeline to either miscount the task/ sub-task, or do the wrong thing
* that will cause the pipeline to either miscount the task/ subtask, or do the wrong thing
* with it. COMPLETED states are never stale, because they finished and don't need to be
* restarted. FAILED and PROCESSING flags are stale, because they indicate incomplete prior
* execution but can prevent the current execution attempt from starting.
@ -91,7 +91,7 @@ public class AlgorithmStateFiles {
* the preceding run.
*/
public void clearStaleState() {
if (!currentSubtaskState().equals(SubtaskState.COMPLETE)) {
if (!currentAlgorithmState().equals(AlgorithmState.COMPLETE)) {
outputsFlag.delete();
}
processingFlag.delete();
@ -99,7 +99,7 @@ public class AlgorithmStateFiles {
}
@AcceptableCatchBlock(rationale = Rationale.EXCEPTION_CHAIN)
public void updateCurrentState(SubtaskState newState) {
public void updateCurrentState(AlgorithmState newState) {
clearState();
try {
@ -135,27 +135,27 @@ public class AlgorithmStateFiles {
}
}
public SubtaskState currentSubtaskState() {
SubtaskState current = SubtaskState.NULL;
public AlgorithmState currentAlgorithmState() {
AlgorithmState current = AlgorithmState.NULL;
if (processingFlag.exists()) {
current = SubtaskState.PROCESSING;
current = AlgorithmState.PROCESSING;
}
if (completeFlag.exists()) {
if (current != SubtaskState.NULL) {
if (current != AlgorithmState.NULL) {
log.warn("Duplicate algorithm state files found!");
return null;
}
current = SubtaskState.COMPLETE;
current = AlgorithmState.COMPLETE;
}
if (failedFlag.exists()) {
if (current != SubtaskState.NULL) {
if (current != AlgorithmState.NULL) {
log.warn("Duplicate algorithm state files found!");
return null;
}
current = SubtaskState.FAILED;
current = AlgorithmState.FAILED;
}
log.debug("current state: " + current);
log.debug("current={}", current);
return current;
}
@ -164,20 +164,20 @@ public class AlgorithmStateFiles {
*
* @return
*/
public boolean subtaskStateExists() {
public boolean stateExists() {
return completeFlag.exists() || processingFlag.exists() || failedFlag.exists();
}
public boolean isProcessing() {
return currentSubtaskState() == SubtaskState.PROCESSING;
return currentAlgorithmState() == AlgorithmState.PROCESSING;
}
public boolean isComplete() {
return currentSubtaskState() == SubtaskState.COMPLETE;
return currentAlgorithmState() == AlgorithmState.COMPLETE;
}
public boolean isFailed() {
return currentSubtaskState() == SubtaskState.FAILED;
return currentAlgorithmState() == AlgorithmState.FAILED;
}
/**

View File

@ -0,0 +1,9 @@
package gov.nasa.ziggy.module;
public enum AlgorithmType {
/** Local execution only. */
LOCAL,
/** Execution occurs on another host or system. */
REMOTE
}

View File

@ -34,34 +34,33 @@
package gov.nasa.ziggy.module;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.nio.file.Paths;
import java.io.InputStreamReader;
import java.io.UncheckedIOException;
import java.util.HashSet;
import java.util.Set;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledThreadPoolExecutor;
import java.util.concurrent.Semaphore;
import java.util.concurrent.ThreadFactory;
import java.util.concurrent.TimeUnit;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.google.common.util.concurrent.ThreadFactoryBuilder;
import gov.nasa.ziggy.module.StateFile.State;
import gov.nasa.ziggy.module.remote.TimestampFile;
import gov.nasa.ziggy.module.AlgorithmStateFiles.AlgorithmState;
import gov.nasa.ziggy.services.config.PropertyName;
import gov.nasa.ziggy.services.config.ZiggyConfiguration;
import gov.nasa.ziggy.services.logging.TaskLog;
import gov.nasa.ziggy.util.AcceptableCatchBlock;
import gov.nasa.ziggy.util.AcceptableCatchBlock.Rationale;
import gov.nasa.ziggy.util.BuildInfo;
import gov.nasa.ziggy.util.HostNameUtils;
import gov.nasa.ziggy.util.TimeFormatter;
import gov.nasa.ziggy.util.io.LockManager;
import gov.nasa.ziggy.util.io.ZiggyFileUtils;
/**
* Acts as a controller for a single-node remote job and associated subtasks running on the node.
@ -78,24 +77,18 @@ import gov.nasa.ziggy.util.io.LockManager;
* @author Todd Klaus
* @author PT
*/
public class ComputeNodeMaster implements Runnable {
public class ComputeNodeMaster {
private static final Logger log = LoggerFactory.getLogger(ComputeNodeMaster.class);
private static final long SLEEP_INTERVAL_MILLIS = 10000;
private final String workingDir;
private int coresPerNode;
private final StateFile stateFile;
private final File taskDir;
private final File stateFileLockFile;
private String nodeName = "<unknown>";
private TaskMonitor monitor;
private SubtaskServer subtaskServer;
private Semaphore subtaskMasterSemaphore;
private CountDownLatch monitoringLatch = new CountDownLatch(1);
private CountDownLatch subtaskMasterCountdownLatch;
private ExecutorService threadPool;
private Set<SubtaskMaster> subtaskMasters = new HashSet<>();
@ -106,209 +99,107 @@ public class ComputeNodeMaster implements Runnable {
this.workingDir = workingDir;
log.info("RemoteTaskMaster START");
log.info(" workingDir = " + workingDir);
log.info(" workingDir = {}", workingDir);
// Tell anyone who cares that this task is no longer queued.
new AlgorithmStateFiles(new File(workingDir)).updateCurrentState(AlgorithmState.PROCESSING);
nodeName = HostNameUtils.shortHostName();
stateFile = StateFile.of(Paths.get(workingDir)).newStateFileFromDiskFile();
taskDir = new File(workingDir);
stateFileLockFile = stateFile.lockFile();
}
/**
* Initializes the {@link ComputeNodeMaster}. Specifically, it locates the file that carries the
* node name of the node with the {@link SubtaskServer} and starts new threads for the
* {@link SubtaskMaster} instances. For the node that is going to host the {@link SubtaskServer}
* instance it also starts the server, updates the {@link StateFile}, creates symlinks, and
* creates task-start timestamps.
* <p>
* The {@link #initialize()} method returns a boolean that indicates whether monitoring is
* required. This returns true if there are unprocessed subtasks and false if all subtasks are
* actually processed.
* Initializes the {@link ComputeNodeMaster}. A number of timestamp files are created if needed
* in the task directory; a {@link SubtaskServer} instance is created in for the node; a
* collection of {@link SubtaskMaster} instances are started.
*/
public boolean initialize() {
public void initialize() {
log.info("Starting ComputeNodeMaster ({})",
ZiggyConfiguration.getInstance().getString(PropertyName.ZIGGY_VERSION.property()));
log.info("Starting ComputeNodeMaster ({}, {})", BuildInfo.ziggyVersion(),
BuildInfo.pipelineVersion());
ZiggyConfiguration.logJvmProperties();
// It's possible that this node isn't starting until all of the subtasks are
// complete! In that case, it should just exit without doing anything else.
monitor = new TaskMonitor(stateFile, taskDir);
monitor.updateState();
if (monitor.allSubtasksProcessed()) {
StateFile updatedStateFile = new StateFile(stateFile);
updatedStateFile.setState(StateFile.State.COMPLETE);
StateFile.updateStateFile(stateFile, updatedStateFile);
log.info("All subtasks processed, ComputeNodeMaster exiting");
return false;
}
coresPerNode = activeCoresFromFile();
coresPerNode = stateFile.getActiveCoresPerNode();
updateStateFile();
subtaskServer().start();
createTimestamps();
log.info("Starting " + coresPerNode + " subtask masters");
log.info("Starting {} subtask masters", coresPerNode);
startSubtaskMasters();
return true;
}
/**
* Moves the state file for this task from QUEUED to PROCESSING.
*
* @throws IOException if unable to release write lock on state file.
*/
private void updateStateFile() {
// NB: If there are multiple jobs associated with a single task, this update only
// needs to be performed if this job is the first to start
boolean stateFileLockObtained = getWriteLockWithoutBlocking(stateFileLockFile);
try {
StateFile previousStateFile = new StateFile(stateFile);
if (stateFileLockObtained
&& previousStateFile.getState().equals(StateFile.State.QUEUED)) {
stateFile.setState(StateFile.State.PROCESSING);
log.info("Updating state: " + previousStateFile + " -> " + stateFile);
if (!StateFile.updateStateFile(previousStateFile, stateFile)) {
log.error("Failed to update state file: " + previousStateFile);
}
} else {
log.info("State file already moved to PROCESSING state, not modifying state file");
stateFile.setState(StateFile.State.PROCESSING);
}
} finally {
if (stateFileLockObtained) {
releaseWriteLock(stateFileLockFile);
}
}
}
private void createTimestamps() {
long arriveTime = stateFile.getPfeArrivalTimeMillis() != StateFile.INVALID_VALUE
? stateFile.getPfeArrivalTimeMillis()
: System.currentTimeMillis();
long submitTime = stateFile.getPbsSubmitTimeMillis();
long arriveTime = System.currentTimeMillis();
TimestampFile.create(taskDir, TimestampFile.Event.ARRIVE_PFE, arriveTime);
TimestampFile.create(taskDir, TimestampFile.Event.QUEUED_PBS, submitTime);
TimestampFile.create(taskDir, TimestampFile.Event.PBS_JOB_START);
TimestampFile.create(taskDir, TimestampFile.Event.ARRIVE_COMPUTE_NODES, arriveTime);
TimestampFile.create(taskDir, TimestampFile.Event.START);
}
@AcceptableCatchBlock(rationale = Rationale.EXCEPTION_CHAIN)
private int activeCoresFromFile() {
try (BufferedReader reader = new BufferedReader(new InputStreamReader(
new FileInputStream(
taskDir.toPath().resolve(AlgorithmExecutor.ACTIVE_CORES_FILE_NAME).toFile()),
ZiggyFileUtils.ZIGGY_CHARSET))) {
String fileText = reader.readLine();
return Integer.parseInt(fileText);
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}
/**
* Starts the {@link SubtaskMaster} instances in the threads of a thread pool, one thread per
* active cores on this node. A {@link Semaphore} is used to track the number of
* active cores on this node. A {@link CountDownLatch} is used to track the number of
* {@link SubtaskMaster} instances currently running.
*/
@AcceptableCatchBlock(rationale = Rationale.CAN_NEVER_OCCUR)
private void startSubtaskMasters() {
int timeoutSecs = (int) TimeFormatter
.timeStringHhMmSsToTimeInSeconds(stateFile.getRequestedWallTime());
subtaskMasterSemaphore = new Semaphore(coresPerNode);
int timeoutSecs = wallTimeFromFile();
subtaskMasterCountdownLatch = new CountDownLatch(coresPerNode);
threadPool = subtaskMasterThreadPool();
ThreadFactory threadFactory = new ThreadFactoryBuilder().setNameFormat("SubtaskMaster[%d]")
.build();
String executableName = ZiggyConfiguration.getInstance()
.getString(PropertyName.ZIGGY_ALGORITHM_NAME.property());
for (int i = 0; i < coresPerNode; i++) {
try {
subtaskMasterSemaphore.acquire();
} catch (InterruptedException e) {
// This can never occur. The number of permits is equal to the number of threads,
// thus there is no need to wait for a permit to become available.
throw new AssertionError(e);
}
SubtaskMaster subtaskMaster = new SubtaskMaster(i, nodeName, subtaskMasterSemaphore,
stateFile.getExecutableName(), workingDir, timeoutSecs);
SubtaskMaster subtaskMaster = new SubtaskMaster(i, nodeName,
subtaskMasterCountdownLatch, executableName, workingDir, timeoutSecs);
subtaskMasters.add(subtaskMaster);
threadPool.submit(subtaskMaster, threadFactory);
}
}
/**
* Performs periodic checks of subtask processing status. This is accomplished by using a
* {@link ScheduledThreadPoolExecutor} to check processing at the desired intervals. Execution
* of the {@link TaskMonitor} thread will block until the monitoring checks determine that
* processing is completed, at which time the thread resumes execution.
* <p>
* The specific conditions under which the monitor will resume execution of the current thread
* are as follows:
* <ol>
* <li>All of the {@link SubtaskMaster} threads have completed.
* <li>All of the subtasks are either completed or failed.
* </ol>
*/
@AcceptableCatchBlock(rationale = Rationale.SYSTEM_EXIT)
public void monitor() {
@AcceptableCatchBlock(rationale = Rationale.EXCEPTION_CHAIN)
private int wallTimeFromFile() {
try (BufferedReader reader = new BufferedReader(new InputStreamReader(
new FileInputStream(
taskDir.toPath().resolve(AlgorithmExecutor.WALL_TIME_FILE_NAME).toFile()),
ZiggyFileUtils.ZIGGY_CHARSET))) {
String fileText = reader.readLine();
return Integer.parseInt(fileText);
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}
log.info("Waiting for subtasks to complete");
ScheduledThreadPoolExecutor monitoringThreadPool = new ScheduledThreadPoolExecutor(1);
monitoringThreadPool.scheduleAtFixedRate(this, 0L, SLEEP_INTERVAL_MILLIS,
TimeUnit.MILLISECONDS);
public void awaitSubtaskMastersComplete() {
try {
monitoringLatch.await();
subtaskMasterCountdownLatch.await();
} catch (InterruptedException e) {
// If the ComputeNodeMaster main thread is interrupted, it means that the entire
// ComputeNodeMaster is shutting down. We can simply allow it to shut down and don't
// need to do anything further.
Thread.currentThread().interrupt();
}
monitoringThreadPool.shutdownNow();
}
@Override
public void run() {
if (monitoringLatch.getCount() == 0) {
return;
}
// If the subtask server has failed, then we
// don't need to do any finalization, just exit and start the process of all subtask
// master threads stopping.
if (!subtaskServer().isListenerRunning()) {
log.error("ComputeNodeMaster: error exit");
endMonitoring();
return;
}
// Do state checks and updates
boolean allSubtasksProcessed = monitor.allSubtasksProcessed();
monitor.updateState();
// If all the subtasks are either completed or failed, exit monitoring
// immediately
if (allSubtasksProcessed) {
log.info("All subtasks complete");
endMonitoring();
return;
}
// If all RemoteSubtaskMasters are done we can exit monitoring
if (allPermitsAvailable()) {
endMonitoring();
}
}
/**
* Ends monitoring. This is accomplished by decrementing the {@link CountDownLatch} that the
* main thread is waiting for.
*/
private synchronized void endMonitoring() {
monitoringLatch.countDown();
}
/**
* If monitoring ended with successful completion of the job, create a timestamp for the
* completion time in the task directory and mark the task's {@link StateFile} as done.
* completion time in the task directory.
*/
public void finish() {
TimestampFile.create(taskDir, TimestampFile.Event.PBS_JOB_FINISH);
if (monitor.allSubtasksProcessed()) {
monitor.markStateFileDone();
}
TimestampFile.create(taskDir, TimestampFile.Event.FINISH);
log.info("ComputeNodeMaster: Done");
}
@ -323,40 +214,10 @@ public class ComputeNodeMaster implements Runnable {
subtaskServer().shutdown();
}
// The following getter methods are intended for testing purposes only. They do not expose any
// of the ComputeNodeMaster's private objects to callers. In some cases it is necessary for
// the methods to be public, as they are used by tests in other packages.
public long getCountDownLatchCount() {
return monitoringLatch.getCount();
}
public int getSemaphorePermits() {
if (subtaskMasterSemaphore == null) {
return -1;
}
return subtaskMasterSemaphore.availablePermits();
}
int subtaskMastersCount() {
return subtaskMasters.size();
}
State getStateFileState() {
return stateFile.getState();
}
int getStateFileNumComplete() {
return stateFile.getNumComplete();
}
int getStateFileNumFailed() {
return stateFile.getNumFailed();
}
int getStateFileNumTotal() {
return stateFile.getNumTotal();
}
/**
* Restores the {@link TaskConfigurationHandler} from disk. Package scope so it can be replaced
* with a mocked instance.
@ -368,31 +229,6 @@ public class ComputeNodeMaster implements Runnable {
return inputsHandler;
}
/**
* Attempts to obtain the lock file for the task's state file, but does not block if it cannot
* obtain it. Broken out as a separate method to support testing.
*
* @return true if lock obtained, false otherwise.
*/
boolean getWriteLockWithoutBlocking(File lockFile) {
return LockManager.getWriteLockWithoutBlocking(lockFile);
}
/**
* Releases the write lock on a file. Broken out as a separate method to support testing.
*/
void releaseWriteLock(File lockFile) {
LockManager.releaseWriteLock(lockFile);
}
/**
* Determines whether all permits are available for the {@link Semaphore} that works with the
* {@link SubtaskMaster} instances. Broken out as a separate method to support testing.
*/
boolean allPermitsAvailable() {
return subtaskMasterSemaphore.availablePermits() == coresPerNode;
}
/**
* Returns a new instance of {@link SubtaskServer}. Broken out as a separate method to support
* testing.
@ -424,10 +260,9 @@ public class ComputeNodeMaster implements Runnable {
ComputeNodeMaster computeNodeMaster = null;
// Startup: constructor and initialization
boolean monitoringRequired = false;
try {
computeNodeMaster = new ComputeNodeMaster(workingDir);
monitoringRequired = computeNodeMaster.initialize();
computeNodeMaster.initialize();
} catch (Exception e) {
// Any exception that occurs in the constructor or initialization is
@ -439,11 +274,8 @@ public class ComputeNodeMaster implements Runnable {
System.exit(1);
}
// Monitoring: wait for subtasks to finish, subtask masters to finish, or exceptions
// to be thrown
if (monitoringRequired) {
computeNodeMaster.monitor();
}
// Wait for the subtask masters to finish.
computeNodeMaster.awaitSubtaskMastersComplete();
// Wrap-up: finalize and clean up
computeNodeMaster.finish();

View File

@ -147,10 +147,10 @@ public class DatastoreDirectoryPipelineInputs implements PipelineInputs {
public SubtaskInformation subtaskInformation(PipelineDefinitionNode pipelineDefinitionNode) {
if (pipelineDefinitionNode.getSingleSubtask()) {
return new SubtaskInformation(getPipelineTask().getModuleName(),
getPipelineTask().uowTaskInstance().briefState(), 1);
getPipelineTask().getUnitOfWork().briefState(), 1);
}
return new SubtaskInformation(getPipelineTask().getModuleName(),
getPipelineTask().uowTaskInstance().briefState(),
getPipelineTask().getUnitOfWork().briefState(),
datastoreFileManager().subtaskCount(pipelineDefinitionNode));
}

View File

@ -4,7 +4,7 @@ import java.nio.file.Path;
import java.util.Collection;
import java.util.Set;
import org.apache.commons.collections.CollectionUtils;
import org.apache.commons.collections4.CollectionUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

View File

@ -31,13 +31,13 @@ import gov.nasa.ziggy.data.management.DatastoreProducerConsumerOperations;
import gov.nasa.ziggy.metrics.IntervalMetric;
import gov.nasa.ziggy.metrics.Metric;
import gov.nasa.ziggy.metrics.ValueMetric;
import gov.nasa.ziggy.module.remote.TimestampFile;
import gov.nasa.ziggy.pipeline.definition.PipelineModule;
import gov.nasa.ziggy.pipeline.definition.PipelineModuleDefinition;
import gov.nasa.ziggy.pipeline.definition.PipelineTask;
import gov.nasa.ziggy.pipeline.definition.PipelineTaskMetrics;
import gov.nasa.ziggy.pipeline.definition.PipelineTaskMetrics.Units;
import gov.nasa.ziggy.pipeline.definition.PipelineTaskMetric;
import gov.nasa.ziggy.pipeline.definition.PipelineTaskMetric.Units;
import gov.nasa.ziggy.pipeline.definition.ProcessingStep;
import gov.nasa.ziggy.services.alert.Alert.Severity;
import gov.nasa.ziggy.services.alert.AlertService;
import gov.nasa.ziggy.services.config.DirectoryProperties;
import gov.nasa.ziggy.services.config.PropertyName;
@ -90,7 +90,7 @@ public class ExternalProcessPipelineModule extends PipelineModule {
@Override
public List<RunMode> restartModes() {
return List.of(RunMode.RESTART_FROM_BEGINNING, RunMode.RESUBMIT,
RunMode.RESUME_CURRENT_STEP, RunMode.RESUME_MONITORING);
RunMode.RESUME_CURRENT_STEP);
}
protected File getTaskDir() {
@ -138,11 +138,11 @@ public class ExternalProcessPipelineModule extends PipelineModule {
// Set the next step, whatever it might be.
incrementProcessingStep();
// If there are sub-task inputs, then we can go on to the next step.
// If there are subtask inputs, then we can go on to the next step.
successful = true;
} else {
// If there are no sub-task inputs, we should stop processing.
// If there are no subtask inputs, we should stop processing.
successful = false;
checkHaltRequest(ProcessingStep.MARSHALING);
}
@ -158,7 +158,7 @@ public class ExternalProcessPipelineModule extends PipelineModule {
File taskWorkingDirectory) {
pipelineInputs().copyDatastoreFilesToTaskDirectory(taskConfiguration,
taskWorkingDirectory.toPath());
pipelineTaskOperations().updateSubtaskCounts(pipelineTask().getId(),
pipelineTaskDataOperations().updateSubtaskCounts(pipelineTask,
taskConfiguration.getSubtaskCount(), 0, 0);
}
@ -211,7 +211,7 @@ public class ExternalProcessPipelineModule extends PipelineModule {
checkHaltRequest(ProcessingStep.EXECUTING);
doneLooping = true;
processingSuccessful = false;
log.info("Resubmitting {} algorithm to remote system", ProcessingStep.EXECUTING);
log.info("Resubmitting {} algorithm to remote system...done", ProcessingStep.EXECUTING);
}
/**
@ -240,19 +240,19 @@ public class ExternalProcessPipelineModule extends PipelineModule {
// add metrics for "RemoteWorker", "PleiadesQueue", "Matlab",
// "PendingReceive"
long remoteWorkerTime = timestampFileElapsedTimeMillis(TimestampFile.Event.ARRIVE_PFE,
TimestampFile.Event.QUEUED_PBS);
long pleiadesQueueTime = timestampFileElapsedTimeMillis(TimestampFile.Event.QUEUED_PBS,
TimestampFile.Event.PBS_JOB_START);
long pleiadesWallTime = timestampFileElapsedTimeMillis(
TimestampFile.Event.PBS_JOB_START, TimestampFile.Event.PBS_JOB_FINISH);
long remoteWorkerTime = timestampFileElapsedTimeMillis(
TimestampFile.Event.ARRIVE_COMPUTE_NODES, TimestampFile.Event.QUEUED);
long pleiadesQueueTime = timestampFileElapsedTimeMillis(TimestampFile.Event.QUEUED,
TimestampFile.Event.START);
long pleiadesWallTime = timestampFileElapsedTimeMillis(TimestampFile.Event.START,
TimestampFile.Event.FINISH);
long pendingReceiveTime = startTransferTime
- timestampFileTimestamp(TimestampFile.Event.PBS_JOB_FINISH);
- timestampFileTimestamp(TimestampFile.Event.FINISH);
log.info("remoteWorkerTime = " + remoteWorkerTime);
log.info("pleiadesQueueTime = " + pleiadesQueueTime);
log.info("pleiadesWallTime = " + pleiadesWallTime);
log.info("pendingReceiveTime = " + pendingReceiveTime);
log.info("remoteWorkerTime = {}", remoteWorkerTime);
log.info("pleiadesQueueTime = {}", pleiadesQueueTime);
log.info("pleiadesWallTime = {}", pleiadesWallTime);
log.info("pendingReceiveTime = {}", pendingReceiveTime);
valueMetricAddValue(REMOTE_WORKER_WAIT_METRIC, remoteWorkerTime);
valueMetricAddValue(PLEIADES_QUEUE_METRIC, pleiadesQueueTime);
@ -262,9 +262,9 @@ public class ExternalProcessPipelineModule extends PipelineModule {
ProcessingFailureSummary failureSummary = processingFailureSummary();
boolean abandonPersisting = false;
if (!failureSummary.isAllTasksSucceeded() && !failureSummary.isAllTasksFailed()) {
log.info("Sub-task failures occurred. List of sub-task failures follows:");
for (String failedSubTask : failureSummary.getFailedSubTaskDirs()) {
log.info(" " + failedSubTask);
log.info("Subtask failures occurred. List of subtask failures follows:");
for (String failedSubtask : failureSummary.getFailedSubtaskDirs()) {
log.info(" {}", failedSubtask);
}
ImmutableConfiguration config = ZiggyConfiguration.getInstance();
boolean allowPartialTasks = config
@ -272,11 +272,11 @@ public class ExternalProcessPipelineModule extends PipelineModule {
abandonPersisting = !allowPartialTasks;
}
if (failureSummary.isAllTasksFailed()) {
log.info("All sub-tasks failed in processing, abandoning storage of results");
log.info("All subtasks failed in processing, abandoning storage of results");
abandonPersisting = true;
}
if (abandonPersisting) {
throw new PipelineException("Unable to persist due to sub-task failures");
throw new PipelineException("Unable to persist due to subtask failures");
}
IntervalMetric.measure(STORE_OUTPUTS_METRIC, () -> {
@ -321,7 +321,7 @@ public class ExternalProcessPipelineModule extends PipelineModule {
log.warn("Input file {} produced no output", inputFile.toString());
}
AlertService.getInstance()
.generateAndBroadcastAlert("Algorithm", taskId(), AlertService.Severity.WARNING,
.generateAndBroadcastAlert("Algorithm", pipelineTask(), Severity.WARNING,
inputFiles.getFilesWithoutOutputs()
+ " input files produced no output, see log for details");
}
@ -370,7 +370,7 @@ public class ExternalProcessPipelineModule extends PipelineModule {
@Override
protected void restartFromBeginning() {
pipelineTaskOperations().updateProcessingStep(taskId(), processingSteps().get(0));
pipelineTaskDataOperations().updateProcessingStep(pipelineTask, processingSteps().get(0));
processingMainLoop();
}
@ -381,15 +381,10 @@ public class ExternalProcessPipelineModule extends PipelineModule {
@Override
protected void resubmit() {
pipelineTaskOperations().updateProcessingStep(taskId(), ProcessingStep.SUBMITTING);
pipelineTaskDataOperations().updateProcessingStep(pipelineTask, ProcessingStep.SUBMITTING);
processingMainLoop();
}
@Override
protected void resumeMonitoring() {
algorithmManager().getExecutor().resumeMonitoring();
}
@Override
protected void runStandard() {
processingMainLoop();
@ -403,19 +398,19 @@ public class ExternalProcessPipelineModule extends PipelineModule {
throw new PipelineException("processTask called with incorrect pipeline task");
}
List<PipelineTaskMetrics> summaryMetrics = pipelineTaskOperations()
.summaryMetrics(pipelineTask);
List<PipelineTaskMetric> pipelineTaskMetrics = pipelineTaskDataOperations()
.pipelineTaskMetrics(pipelineTask);
log.debug("Thread Metrics:");
for (String threadMetricName : threadMetrics.keySet()) {
log.debug("TM: " + threadMetricName + ": "
+ threadMetrics.get(threadMetricName).getLogString());
log.debug("TM: {}: {}", threadMetricName,
threadMetrics.get(threadMetricName).getLogString());
}
// cross-reference existing summary metrics by category
Map<String, PipelineTaskMetrics> summaryMetricsByCategory = new HashMap<>();
for (PipelineTaskMetrics summaryMetric : summaryMetrics) {
summaryMetricsByCategory.put(summaryMetric.getCategory(), summaryMetric);
Map<String, PipelineTaskMetric> pipelineTaskMetricByCategory = new HashMap<>();
for (PipelineTaskMetric pipelineTaskMetric : pipelineTaskMetrics) {
pipelineTaskMetricByCategory.put(pipelineTaskMetric.getCategory(), pipelineTaskMetric);
}
String[] categories;
@ -444,16 +439,15 @@ public class ExternalProcessPipelineModule extends PipelineModule {
ValueMetric iMetric = (ValueMetric) metric;
totalTime = iMetric.getSum();
} else {
log.info("Module did not provide metric with name = " + metricName);
log.info("Module did not provide metric with name = {}", metricName);
}
log.info("TaskID={}, category={}, time(ms)={}", pipelineTask.getId(), category,
totalTime);
log.info("TaskID={}, category={}, time(ms)={}", pipelineTask, category, totalTime);
PipelineTaskMetrics m = summaryMetricsByCategory.get(category);
PipelineTaskMetric m = pipelineTaskMetricByCategory.get(category);
if (m == null) {
m = new PipelineTaskMetrics(category, totalTime, unit);
summaryMetrics.add(m);
m = new PipelineTaskMetric(category, totalTime, unit);
pipelineTaskMetrics.add(m);
}
// don't overwrite the existing value if no value was recorded for
@ -463,8 +457,8 @@ public class ExternalProcessPipelineModule extends PipelineModule {
m.setValue(totalTime);
}
}
pipelineTask.setSummaryMetrics(summaryMetrics);
pipelineTaskOperations().merge(pipelineTask);
pipelineTaskDataOperations().updatePipelineTaskMetrics(pipelineTask, pipelineTaskMetrics);
}
/**
@ -491,7 +485,7 @@ public class ExternalProcessPipelineModule extends PipelineModule {
}
long timestampFileTimestamp(TimestampFile.Event event) {
return TimestampFile.timestamp(getTaskDir(), event);
return TimestampFile.timestamp(getTaskDir(), event, false);
}
ValueMetric valueMetricAddValue(String name, long value) {

View File

@ -1,83 +0,0 @@
package gov.nasa.ziggy.module;
import java.util.Collections;
import java.util.Map;
import java.util.Set;
import gov.nasa.ziggy.module.AlgorithmExecutor.AlgorithmType;
import gov.nasa.ziggy.module.remote.QstatMonitor;
import gov.nasa.ziggy.module.remote.QueueCommandManager;
import gov.nasa.ziggy.pipeline.definition.PipelineTask;
import gov.nasa.ziggy.util.HostNameUtils;
/**
* Interface for classes that monitor remote jobs. This allows a dummy implementation to be supplied
* in cases in which there are calls to a remote job monitor but no remote jobs to be monitored (see
* {@link AlgorithmMonitor} for more information).
*
* @author PT
*/
public interface JobMonitor {
/**
* Returns a new instance of {@link JobMonitor} that is correct for its use-case based on
* arguments. In particular, for remote tasks an instance of {@link QstatMonitor} will be
* returned, while for local tasks a dummy instance, with no actual functionality, will be
* returned.
*/
static JobMonitor newInstance(String username, AlgorithmType algorithmType) {
if (algorithmType.equals(AlgorithmType.REMOTE)) {
return new QstatMonitor(username, HostNameUtils.shortHostName());
}
return new JobMonitor() {
};
}
default void addToMonitoring(StateFile stateFile) {
}
default void addToMonitoring(PipelineTask pipelineTask) {
}
default void endMonitoring(StateFile stateFile) {
}
default void update() {
}
default Set<Long> allIncompleteJobIds(PipelineTask pipelineTask) {
return Collections.emptySet();
}
default Set<Long> allIncompleteJobIds(StateFile stateFile) {
return Collections.emptySet();
}
default boolean isFinished(StateFile stateFile) {
return false;
}
default Map<Long, Integer> exitStatus(StateFile stateFile) {
return Collections.emptyMap();
}
default Map<Long, String> exitComment(StateFile stateFile) {
return Collections.emptyMap();
}
default String getOwner() {
return "";
}
default String getServerName() {
return "";
}
default QueueCommandManager getQstatCommandManager() {
return null;
}
default Set<String> getJobsInMonitor() {
return Collections.emptySet();
}
}

View File

@ -12,6 +12,7 @@ import org.slf4j.LoggerFactory;
import gov.nasa.ziggy.module.remote.PbsParameters;
import gov.nasa.ziggy.pipeline.definition.PipelineDefinitionNodeExecutionResources;
import gov.nasa.ziggy.pipeline.definition.PipelineTask;
import gov.nasa.ziggy.pipeline.definition.database.PipelineModuleDefinitionOperations;
import gov.nasa.ziggy.services.config.DirectoryProperties;
import gov.nasa.ziggy.services.logging.TaskLog;
import gov.nasa.ziggy.services.messages.MonitorAlgorithmRequest;
@ -30,6 +31,8 @@ public class LocalAlgorithmExecutor extends AlgorithmExecutor {
private static final Logger log = LoggerFactory.getLogger(LocalAlgorithmExecutor.class);
private PipelineModuleDefinitionOperations pipelineModuleDefinitionOperations = new PipelineModuleDefinitionOperations();
public LocalAlgorithmExecutor(PipelineTask pipelineTask) {
super(pipelineTask);
}
@ -51,20 +54,18 @@ public class LocalAlgorithmExecutor extends AlgorithmExecutor {
}
@Override
protected void addToMonitor(StateFile stateFile) {
ZiggyMessenger.publish(new MonitorAlgorithmRequest(stateFile, algorithmType()));
protected void addToMonitor() {
ZiggyMessenger.publish(new MonitorAlgorithmRequest(pipelineTask, workingDir()));
}
@Override
protected void submitForExecution(StateFile stateFile) {
protected void submitForExecution() {
stateFile.setPbsSubmitTimeMillis(System.currentTimeMillis());
stateFile.persist();
addToMonitor(stateFile);
addToMonitor();
CommandLine cmdLine = algorithmCommandLine();
pipelineTaskOperations().setLocalExecution(pipelineTask.getId());
pipelineTaskDataOperations().updateAlgorithmType(pipelineTask, AlgorithmType.LOCAL);
// Start the external process -- note that it will cause execution to block until
// the algorithm has completed or failed.
@ -79,11 +80,9 @@ public class LocalAlgorithmExecutor extends AlgorithmExecutor {
// as having failed
if (exitCode != 0) {
if (!ZiggyShutdownHook.shutdownInProgress()) {
throw new PipelineException(
"Local processing of task " + pipelineTask.getId() + " failed");
throw new PipelineException("Local processing of task " + pipelineTask + " failed");
}
log.error(
"Task " + pipelineTask.getId() + " processing incomplete due to worker shutdown");
log.error("Task {} processing incomplete due to worker shutdown", pipelineTask);
}
}
@ -114,4 +113,21 @@ public class LocalAlgorithmExecutor extends AlgorithmExecutor {
public AlgorithmType algorithmType() {
return AlgorithmType.LOCAL;
}
@Override
protected String activeCores() {
return "1";
}
@Override
protected String wallTime() {
return Integer.toString(pipelineModuleDefinitionOperations()
.pipelineModuleExecutionResources(pipelineModuleDefinitionOperations()
.pipelineModuleDefinition(pipelineTask.getModuleName()))
.getExeTimeoutSeconds());
}
PipelineModuleDefinitionOperations pipelineModuleDefinitionOperations() {
return pipelineModuleDefinitionOperations;
}
}

View File

@ -1,6 +1,6 @@
package gov.nasa.ziggy.module;
import gov.nasa.ziggy.pipeline.definition.PipelineTaskMetrics.Units;
import gov.nasa.ziggy.pipeline.definition.PipelineTaskMetric.Units;
/**
* Categories used by the pipeline in various contexts.

View File

@ -30,14 +30,14 @@ public abstract class PipelineInputsOutputsUtils implements Persistable {
private static final String SERIALIZED_OUTPUTS_TYPE_FILE = ".output-types.ser";
/**
* Returns the task directory. Assumes that the working directory is the sub-task directory.
* Returns the task directory. Assumes that the working directory is the subtask directory.
*/
public static Path taskDir() {
return DirectoryProperties.workingDir().getParent();
}
/**
* Returns the module executable name. Assumes that the working directory is the sub-task
* Returns the module executable name. Assumes that the working directory is the subtask
* directory.
*/
public static String moduleName() {
@ -45,15 +45,11 @@ public abstract class PipelineInputsOutputsUtils implements Persistable {
}
public static String moduleName(Path taskDir) {
String taskDirString = taskDir.getFileName().toString();
PipelineTask.TaskBaseNameMatcher m = new PipelineTask.TaskBaseNameMatcher(taskDirString);
return m.moduleName();
return new PipelineTask.TaskBaseNameMatcher(taskDir.getFileName().toString()).moduleName();
}
public static long taskId(Path taskDir) {
String taskDirString = taskDir.getFileName().toString();
PipelineTask.TaskBaseNameMatcher m = new PipelineTask.TaskBaseNameMatcher(taskDirString);
return m.taskId();
return new PipelineTask.TaskBaseNameMatcher(taskDir.getFileName().toString()).taskId();
}
/**

View File

@ -7,41 +7,41 @@ import java.util.List;
import gov.nasa.ziggy.module.io.ModuleInterfaceUtils;
/**
* Provides summary information on failed sub-tasks.
* Provides summary information on failed subtasks.
*
* @author PT
*/
public class ProcessingFailureSummary {
private List<String> failedSubTaskDirs = new ArrayList<>();
private List<String> failedSubtaskDirs = new ArrayList<>();
private boolean allTasksFailed = false;
public ProcessingFailureSummary(String moduleName, File taskDirectory) {
SubtaskDirectoryIterator taskDirectoryIterator = new SubtaskDirectoryIterator(
taskDirectory);
int numSubTasks = taskDirectoryIterator.numSubTasks();
int numSubtasks = taskDirectoryIterator.numSubtasks();
// loop over sub-task directories and look for stack trace files
// loop over subtask directories and look for stack trace files
while (taskDirectoryIterator.hasNext()) {
File subTaskDir = taskDirectoryIterator.next().getSubtaskDir();
if (ModuleInterfaceUtils.errorFile(subTaskDir, moduleName).exists()) {
failedSubTaskDirs.add(subTaskDir.getName());
File subtaskDir = taskDirectoryIterator.next().getSubtaskDir();
if (ModuleInterfaceUtils.errorFile(subtaskDir, moduleName).exists()) {
failedSubtaskDirs.add(subtaskDir.getName());
}
}
allTasksFailed = failedSubTaskDirs.size() == numSubTasks;
allTasksFailed = failedSubtaskDirs.size() == numSubtasks;
}
public boolean isAllTasksFailed() {
return allTasksFailed;
}
public List<String> getFailedSubTaskDirs() {
return failedSubTaskDirs;
public List<String> getFailedSubtaskDirs() {
return failedSubtaskDirs;
}
public boolean isAllTasksSucceeded() {
return failedSubTaskDirs.size() == 0;
return failedSubtaskDirs.size() == 0;
}
}

View File

@ -1,888 +0,0 @@
package gov.nasa.ziggy.module;
import java.io.File;
import java.io.FileFilter;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Serializable;
import java.io.UncheckedIOException;
import java.io.Writer;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Date;
import java.util.Iterator;
import java.util.List;
import java.util.Objects;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.commons.configuration2.PropertiesConfiguration;
import org.apache.commons.configuration2.builder.fluent.Configurations;
import org.apache.commons.configuration2.ex.ConfigurationException;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.filefilter.WildcardFileFilter;
import org.apache.commons.lang3.StringUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import gov.nasa.ziggy.module.remote.PbsParameters;
import gov.nasa.ziggy.pipeline.definition.PipelineTask;
import gov.nasa.ziggy.services.config.DirectoryProperties;
import gov.nasa.ziggy.util.AcceptableCatchBlock;
import gov.nasa.ziggy.util.AcceptableCatchBlock.Rationale;
import gov.nasa.ziggy.util.Iso8601Formatter;
import gov.nasa.ziggy.util.io.LockManager;
import gov.nasa.ziggy.util.io.ZiggyFileUtils;
/**
* This class models a file whose name contains the state of a pipeline task executing on a remote
* cluster.
* <p>
* The file also contains additional properties of the remote job.
* <p>
* Each state file represents a single unit of work from the perspective of the pipeline module.
*
* <pre>
* Filename Format:
*
* ziggy-PIID.PTID.EXENAME.STATE_TOTAL-COMPLETE-FAILED
*
* PIID: Pipeline Instance ID
* PTID: Pipeline Task ID
* EXENAME: Name of the MATLAB executable
* STATE: enum(SUBMITTED,PROCESSING,ERRORS_RUNNING,FAILED,COMPLETE)
* TOTAL-COMPLETE-FAILED): Number of jobs in each category
* </pre>
*
* The file contains the properties that reflect the properties in this class, including:
* <dl>
* <dt>timeoutSecs</dt>
* <dd>Timeout for the MATLAB process.</dd>
* <dt>gigsPerCore</dt>
* <dd>Required memory per core used. Used to calculate coresPerNode based on architecture.</dd>
* <dt>tasksPerCore</dt>
* <dd>Number of tasks to allocate to each available core.</dd>
* <dt>remoteNodeArchitecture</dt>
* <dd>Architecture to use.</dd>
* <dt>remoteGroup</dt>
* <dd>Group name used for the qsub command on the remote node.</dd>
* <dt>queueName</dt>
* <dd>Queue name used for the qsub command on the remote node.</dd>
* <dt>reRunnable</dt>
* <dd>Whether this task is re-runnable.</dd>
* <dt>localBinToMatEnabled</dt>
* <dd>If true, don't generate .mat files on the remote node.</dd>
* <dt>requestedWallTime</dt>
* <dd>Requested wall time for the PBS qsub command.</dd>
* <dt>symlinksEnabled</dt>
* <dd>Determines whether symlinks are created between sub-task directories and files in the
* top-level task directory. This should be enabled for modules that store files common to all
* sub-tasks in the top-level task directory.</dd>
* <dt>pbsSubmitTimeMillis</dt>
* <dd>The time in milliseconds that the job was submitted to the Pleiades scheduler, the Portable
* Batch System (PBS).</dd>
* <dt>pfeArrivalTimeMillis</dt>
* <dd>This seems to be the same as pbsSubmitTimeMillis in most cases.</dd>
* </dl>
*
* @author Bill Wohler
* @author Todd Klaus
*/
public class StateFile implements Comparable<StateFile>, Serializable {
private static final long serialVersionUID = 20230511L;
private static final Logger log = LoggerFactory.getLogger(StateFile.class);
public static final Pattern TASK_DIR_PATTERN = Pattern.compile("([0-9]+)-([0-9]+)-(\\S+)");
public static final int TASK_DIR_INSTANCE_ID_GROUP_NUMBER = 1;
public static final int TASK_DIR_TASK_ID_GROUP_NUMBER = 2;
public static final int TASK_DIR_MODULE_NAME_GROUP_NUMBER = 3;
public static final String PREFIX_BARE = "ziggy";
public static final String PREFIX = PREFIX_BARE + ".";
public static final String PREFIX_WITH_BACKSLASHES = PREFIX_BARE + "\\.";
public static final String DEFAULT_REMOTE_NODE_ARCHITECTURE = "none";
public static final String DEFAULT_WALL_TIME = "24:00:00";
public static final String INVALID_STRING = "none";
public static final int INVALID_VALUE = -1;
public static final String LOCK_FILE_NAME = ".state-file.lock";
private static final String REMOTE_NODE_ARCHITECTURE_PROP_NAME = "remoteNodeArchitecture";
private static final String MIN_CORES_PER_NODE_PROP_NAME = "minCoresPerNode";
private static final String MIN_GIGS_PER_NODE_PROP_NAME = "minGigsPerNode";
private static final String REMOTE_GROUP_PROP_NAME = "remoteGroup";
private static final String QUEUE_NAME_PROP_NAME = "queueName";
private static final String REQUESTED_WALLTIME_PROP_NAME = "requestedWallTime";
private static final String REQUESTED_NODE_COUNT_PROP_NAME = "requestedNodeCount";
private static final String ACTIVE_CORES_PER_NODE_PROP_NAME = "activeCoresPerNode";
private static final String GIGS_PER_SUBTASK_PROP_NAME = "gigsPerSubtask";
private static final String EXECUTABLE_NAME_PROP_NAME = "executableName";
private static final String PBS_SUBMIT_PROP_NAME = "pbsSubmitTimeMillis";
private static final String PFE_ARRIVAL_PROP_NAME = "pfeArrivalTimeMillis";
public enum State {
/** Task has been initialized. */
INITIALIZED,
/**
* Task has been submitted by the worker, but not yet picked up by the remote cluster.
*/
SUBMITTED,
/** Task is waiting for compute nodes to become available. */
QUEUED,
/** Task is running on the compute nodes. */
PROCESSING,
/**
* Task has finished running on the compute nodes, either with or without subtask errors.
*/
COMPLETE,
/** The final state for this task has been acknowledged by the worker. */
CLOSED;
}
private static String statesPatternElement;
// Concatenate all of the states into a single String for use in the state file
// name pattern.
static {
StringBuilder sb = new StringBuilder();
for (State state : State.values()) {
sb.append(state.toString());
sb.append("|");
}
sb.setLength(sb.length() - 1);
statesPatternElement = sb.toString();
}
// Pattern and regex for a state file name
private static final String STATE_FILE_NAME_REGEX = PREFIX_WITH_BACKSLASHES
+ "([0-9]+)\\.([0-9]+)\\.(\\S+)\\." + "(" + statesPatternElement + ")"
+ "_([0-9]+)-([0-9]+)-([0-9]+)";
public static final Pattern STATE_FILE_NAME_PATTERN = Pattern.compile(STATE_FILE_NAME_REGEX);
private static final int STATE_FILE_NAME_INSTANCE_ID_GROUP_NUMBER = 1;
private static final int STATE_FILE_NAME_TASK_ID_GROUP_NUMBER = 2;
private static final int STATE_FILE_NAME_MODULE_GROUP_NUMBER = 3;
private static final int STATE_FILE_NAME_STATE_GROUP_NUMBER = 4;
private static final int STATE_FILE_NAME_TOTAL_SUBTASKS_GROUP_NUMBER = 5;
private static final int STATE_FILE_NAME_COMPLETE_SUBTASKS_GROUP_NUMBER = 6;
private static final int STATE_FILE_NAME_FAILED_SUBTASKS_GROUP_NUMBER = 7;
// This is a slightly more informative explanation of the file name format,
// used in log files so the user knows exactly what was expected.
private static final String FORMAT = PREFIX + "PIID.PTID.MODNAME.STATE_TOTAL-COMPLETE-FAILED";
// Fields in the file name.
private long pipelineInstanceId = 0;
private long pipelineTaskId = 0;
private String moduleName;
private State state = State.INITIALIZED;
private int numTotal = 0;
private int numComplete = 0;
private int numFailed = 0;
/** Contains all properties from the file. */
private transient PropertiesConfiguration props = new PropertiesConfiguration();
public StateFile() {
}
/**
* Constructs a StateFile containing only the invariant part.
*
* @param moduleName the name of the pipeline module for the task
* @param pipelineInstanceId the pipeline instance ID
* @param pipelineTaskId the pipeline task ID
*/
public StateFile(String moduleName, long pipelineInstanceId, long pipelineTaskId) {
this.moduleName = moduleName;
this.pipelineInstanceId = pipelineInstanceId;
this.pipelineTaskId = pipelineTaskId;
}
/**
* Constructs a {@link StateFile} instance from a task directory. The task directory name is
* parsed to obtain the module name, instance ID, and task ID components.
*/
public static StateFile of(Path taskDir) {
String taskDirName = taskDir.getFileName().toString();
Matcher matcher = TASK_DIR_PATTERN.matcher(taskDirName);
if (!matcher.matches()) {
throw new IllegalArgumentException(
"Task dir name " + taskDirName + " does not match convention for task dir names");
}
return new StateFile(matcher.group(TASK_DIR_MODULE_NAME_GROUP_NUMBER),
Long.parseLong(matcher.group(TASK_DIR_INSTANCE_ID_GROUP_NUMBER)),
Long.parseLong(matcher.group(TASK_DIR_TASK_ID_GROUP_NUMBER)));
}
/**
* Constructs a StateFile from an existing name.
*/
public StateFile(String name) {
parse(name);
}
/**
* Parses a string of the form: PREFIX + MODNAME.PIID.PTID.STATE_TOTAL-COMPLETE-FAILED)
*/
private void parse(String name) {
Matcher matcher = STATE_FILE_NAME_PATTERN.matcher(name);
if (!matcher.matches()) {
throw new IllegalArgumentException(name + " does not match expected format: " + FORMAT);
}
pipelineInstanceId = Long
.parseLong(matcher.group(STATE_FILE_NAME_INSTANCE_ID_GROUP_NUMBER));
pipelineTaskId = Long.parseLong(matcher.group(STATE_FILE_NAME_TASK_ID_GROUP_NUMBER));
moduleName = matcher.group(STATE_FILE_NAME_MODULE_GROUP_NUMBER);
state = State.valueOf(matcher.group(STATE_FILE_NAME_STATE_GROUP_NUMBER));
numTotal = Integer.parseInt(matcher.group(STATE_FILE_NAME_TOTAL_SUBTASKS_GROUP_NUMBER));
numComplete = Integer
.parseInt(matcher.group(STATE_FILE_NAME_COMPLETE_SUBTASKS_GROUP_NUMBER));
numFailed = Integer.parseInt(matcher.group(STATE_FILE_NAME_FAILED_SUBTASKS_GROUP_NUMBER));
}
/**
* Creates a shallow copy of the given object.
*/
public StateFile(StateFile other) {
// members
moduleName = other.getModuleName();
pipelineInstanceId = other.getPipelineInstanceId();
pipelineTaskId = other.getPipelineTaskId();
state = other.getState();
numTotal = other.getNumTotal();
numComplete = other.getNumComplete();
numFailed = other.getNumFailed();
// properties of the PropertiesConfiguration
List<String> propertyNames = other.getSortedPropertyNames();
for (String propertyName : propertyNames) {
props.setProperty(propertyName, other.props.getProperty(propertyName));
}
}
/**
* Gets the property names of the StateFile's PropertyConfigurations object, sorted
* alphabetically
*
* @return a List<String> of property names defined for this object.
*/
private List<String> getSortedPropertyNames() {
Iterator<String> propertyNamesIterator = props.getKeys();
List<String> propertyNames = new ArrayList<>();
while (propertyNamesIterator.hasNext()) {
propertyNames.add(propertyNamesIterator.next());
}
Collections.sort(propertyNames);
return propertyNames;
}
/**
* Creates a StateFile from the given parameters.
*/
public static StateFile generateStateFile(PipelineTask pipelineTask,
PbsParameters pbsParameters, int numSubtasks) {
StateFile state = new StateFile.Builder().moduleName(pipelineTask.getModuleName())
.executableName(pipelineTask.getExecutableName())
.pipelineInstanceId(pipelineTask.getPipelineInstanceId())
.pipelineTaskId(pipelineTask.getId())
.numTotal(numSubtasks)
.numComplete(0)
.numFailed(0)
.state(StateFile.State.QUEUED)
.build();
if (pbsParameters != null) {
state.setActiveCoresPerNode(pbsParameters.getActiveCoresPerNode());
state.setRemoteNodeArchitecture(pbsParameters.getArchitecture().getNodeName());
state.setMinCoresPerNode(pbsParameters.getMinCoresPerNode());
state.setMinGigsPerNode(pbsParameters.getMinGigsPerNode());
state.setRemoteGroup(pbsParameters.getRemoteGroup());
state.setQueueName(pbsParameters.getQueueName());
state.setRequestedWallTime(pbsParameters.getRequestedWallTime());
state.setRequestedNodeCount(pbsParameters.getRequestedNodeCount());
state.setGigsPerSubtask(pbsParameters.getGigsPerSubtask());
} else {
state.setActiveCoresPerNode(1);
state.setRemoteNodeArchitecture("");
state.setRemoteGroup("");
state.setQueueName("");
}
return state;
}
/**
* Persists this {@link StateFile} to the state file directory.
*/
@AcceptableCatchBlock(rationale = Rationale.CAN_NEVER_OCCUR)
@AcceptableCatchBlock(rationale = Rationale.EXCEPTION_CHAIN)
public File persist() {
File directory = DirectoryProperties.stateFilesDir().toFile();
File file = new File(directory, name());
try (Writer fw = new OutputStreamWriter(new FileOutputStream(file),
ZiggyFileUtils.ZIGGY_CHARSET)) {
props.write(fw);
// Also, move any old state files that are for the same instance and
// task as this one.
moveOldStateFiles(directory);
} catch (ConfigurationException e) {
// This can never occur. The construction of the props field is guaranteed
// to be correct.
throw new AssertionError(e);
} catch (IOException e) {
throw new UncheckedIOException("Unable to write to file " + file.toString(), e);
}
return file;
}
private void moveOldStateFiles(File stateFileDir) {
String stateFileName = name();
FileFilter fileFilter = new WildcardFileFilter(invariantPart() + "*");
File[] matches = stateFileDir.listFiles(fileFilter);
if (matches == null || matches.length == 0) {
throw new PipelineException(
"State file \"" + stateFileName + "\" does not exist or there was an I/O error.");
}
String iso8601Date = Iso8601Formatter.dateTimeLocalFormatter().format(new Date());
int fileCounter = 0;
// For all the matched files that are NOT the current state file, rename
// the old ones to a new name that removes the "ziggy" at the beginning
// (replacing it with "old"), and appends a datestamp and index #.
for (File match : matches) {
if (!match.getName().equals(stateFileName)) {
String nameSansPrefix = match.getName().substring(PREFIX.length());
String newName = "old." + nameSansPrefix + "." + iso8601Date + "."
+ Integer.toString(fileCounter);
File newFile = new File(stateFileDir, newName);
match.renameTo(newFile);
log.warn("File " + match.getName() + " in directory " + stateFileDir
+ " renamed to " + newFile.getName());
}
}
}
@AcceptableCatchBlock(rationale = Rationale.EXCEPTION_CHAIN)
public static boolean updateStateFile(StateFile oldStateFile, StateFile newStateFile) {
File stateFileDir = DirectoryProperties.stateFilesDir().toFile();
if (oldStateFile.equals(newStateFile)) {
log.debug("Old state file " + oldStateFile.name() + " is the same as new state file "
+ newStateFile.name() + ", not changing");
}
// Update the state file.
log.info("Updating state: " + oldStateFile + " -> " + newStateFile);
File oldFile = new File(stateFileDir, oldStateFile.name());
File newFile = new File(stateFileDir, newStateFile.name());
log.debug(" renaming file: " + oldFile + " -> " + newFile);
try {
FileUtils.moveFile(oldFile, newFile);
} catch (IOException e) {
throw new UncheckedIOException("Unable to move file " + oldFile.toString(), e);
}
return true;
}
/**
* Returns a {@link List} of {@link StateFile} instances from the state files directory that are
* in the {@link State#PROCESSING} state.
*/
public static List<StateFile> processingStateFilesFromDisk() {
File directory = DirectoryProperties.stateFilesDir().toFile();
String[] filenames = directory.list((dir, name) -> {
Matcher matcher = STATE_FILE_NAME_PATTERN.matcher(name);
return matcher.matches() && matcher.group(STATE_FILE_NAME_STATE_GROUP_NUMBER)
.equals(State.PROCESSING.toString());
});
List<StateFile> stateFiles = new ArrayList<>();
for (String filename : filenames) {
stateFiles.add(new StateFile(filename));
}
return stateFiles;
}
/**
* Constructs a new {@link StateFile} from an existing file.
*
* @return a {@link StateFile} object derived from the specified file on disk
*/
public StateFile newStateFileFromDiskFile() {
return newStateFileFromDiskFile(false);
}
/**
* Constructs a new {@link StateFile} from an existing file.
*
* @param silent when true suppresses the logging message from matching the disk file.
* @return a StateFile object derived from the specified file on disk
*/
@AcceptableCatchBlock(rationale = Rationale.CAN_NEVER_OCCUR)
public StateFile newStateFileFromDiskFile(boolean silent) {
File stateFilePathToUse = StateFile.getDiskFileFromInvariantNamePart(this);
if (!silent) {
log.info("Matched statefile: " + stateFilePathToUse);
}
StateFile stateFile = new StateFile(stateFilePathToUse.getName());
try {
PropertiesConfiguration props = new Configurations().properties(stateFilePathToUse);
if (props.isEmpty()) {
throw new PipelineException("State file contains no properties!");
}
stateFile.props = props;
} catch (ConfigurationException e) {
// This can never occur. By construction, the props field is guaranteee
// to be constructed correctly.
throw new AssertionError(e);
}
return stateFile;
}
/**
* Searches a directory for a state file where the name's invariant part matches that provided
* by the caller.
*
* @param oldStateFile existing {@link StateFile} instance
* @return the file in the directory of the stateFilePath that has a name for which the
* invariant part matches the invariant part of the stateFilePath name (i.e.,
* ziggy.pa.6.23.QUEUED.10.0.0 will match ziggy.pa.6.23.* on disk)
* @exception IllegalArgumentException if the stateFilePath's parent is not a directory
* @exception IllegalStateException if there isn't only one StateFile that matches the invariant
* part of stateFilePath
*/
private static File getDiskFileFromInvariantNamePart(StateFile oldStateFile) {
// Sadly, this is probably the easiest way to do this.
String invariantPart = oldStateFile.invariantPart();
File directory = DirectoryProperties.stateFilesDir().toFile();
if (!directory.exists() || !directory.isDirectory()) {
throw new IllegalArgumentException(
"Specified directory does not exist or is not a directory: " + directory);
}
FileFilter fileFilter = new WildcardFileFilter(invariantPart + "*");
File[] matches = directory.listFiles(fileFilter);
if (matches == null) {
throw new IllegalStateException(
"No state file matching " + invariantPart + "* in directory " + directory);
}
if (matches.length > 1) {
throw new IllegalStateException("More than one state file matches " + invariantPart
+ "* in directory " + directory);
}
return matches[0];
}
/**
* Builds the name of the state file based on the elements.
*/
public String name() {
return invariantPart() + "." + state + "_" + numTotal + "-" + numComplete + "-" + numFailed;
}
/**
* Returns the invariant part of the state file name. This includes the static PREFIX and the
* pipeline instance and task ids, plus the module name.
*/
public String invariantPart() {
return PREFIX + pipelineInstanceId + "." + pipelineTaskId + "." + moduleName;
}
public static String invariantPart(PipelineTask task) {
return PREFIX + task.getPipelineInstanceId() + "." + task.getId() + "."
+ task.getModuleName();
}
public String taskBaseName() {
return PipelineTask.taskBaseName(pipelineInstanceId, pipelineTaskId, moduleName);
}
/**
* Returns the name of the task dir represented by this StateFile.
*/
public String taskDirName() {
return PipelineTask.taskBaseName(getPipelineInstanceId(), getPipelineTaskId(),
getModuleName());
}
public void setStateAndPersist(State state) {
LockManager.getWriteLockOrBlock(lockFile());
try {
StateFile oldState = newStateFileFromDiskFile(true);
StateFile newState = new StateFile(oldState);
newState.setState(state);
StateFile.updateStateFile(oldState, newState);
} finally {
LockManager.releaseWriteLock(lockFile());
}
}
public File lockFile() {
return DirectoryProperties.taskDataDir()
.resolve(taskDirName())
.resolve(LOCK_FILE_NAME)
.toFile();
}
public boolean isDone() {
return state == State.COMPLETE || state == State.CLOSED;
}
public boolean isRunning() {
return state == State.PROCESSING;
}
public boolean isQueued() {
return state == State.QUEUED;
}
public boolean isStarted() {
return isRunning() || isDone();
}
@Override
public int compareTo(StateFile o) {
return name().compareTo(o.name());
}
@Override
public String toString() {
return name();
}
public long getPipelineInstanceId() {
return pipelineInstanceId;
}
public void setPipelineInstanceId(long pipelineInstanceId) {
this.pipelineInstanceId = pipelineInstanceId;
}
public long getPipelineTaskId() {
return pipelineTaskId;
}
public void setPipelineTaskId(long pipelineTaskId) {
this.pipelineTaskId = pipelineTaskId;
}
public String getModuleName() {
return moduleName;
}
public void setModuleName(String moduleName) {
this.moduleName = moduleName;
}
public State getState() {
return state;
}
public void setState(State state) {
this.state = state;
}
public int getNumTotal() {
return numTotal;
}
public void setNumTotal(int numTotal) {
this.numTotal = numTotal;
}
public int getNumComplete() {
return numComplete;
}
public void setNumComplete(int numComplete) {
this.numComplete = numComplete;
}
public int getNumFailed() {
return numFailed;
}
public void setNumFailed(int numFailed) {
this.numFailed = numFailed;
}
public String getExecutableName() {
return props.getProperty(EXECUTABLE_NAME_PROP_NAME) != null
&& !StringUtils.isBlank(props.getString(EXECUTABLE_NAME_PROP_NAME))
? props.getString(EXECUTABLE_NAME_PROP_NAME)
: INVALID_STRING;
}
public void setExecutableName(String executableName) {
props.setProperty(EXECUTABLE_NAME_PROP_NAME, executableName);
}
/**
* Returns the value of the {@value #REMOTE_NODE_ARCHITECTURE_PROP_NAME} property, or
* {@link #DEFAULT_REMOTE_NODE_ARCHITECTURE} if not present or set.
*/
public String getRemoteNodeArchitecture() {
return props.getProperty(REMOTE_NODE_ARCHITECTURE_PROP_NAME) != null
&& !props.getString(REMOTE_NODE_ARCHITECTURE_PROP_NAME).isBlank()
? props.getString(REMOTE_NODE_ARCHITECTURE_PROP_NAME)
: DEFAULT_REMOTE_NODE_ARCHITECTURE;
}
public void setRemoteNodeArchitecture(String remoteNodeArchitecture) {
props.setProperty(REMOTE_NODE_ARCHITECTURE_PROP_NAME, remoteNodeArchitecture);
}
/**
* Returns the value of the {@value #MIN_CORES_PER_NODE_PROP_NAME} property, or 0 if not present
* or set.
*/
public int getMinCoresPerNode() {
return props.getProperty(MIN_CORES_PER_NODE_PROP_NAME) != null
? props.getInt(MIN_CORES_PER_NODE_PROP_NAME)
: INVALID_VALUE;
}
public void setMinCoresPerNode(int minCoresPerNode) {
props.setProperty(MIN_CORES_PER_NODE_PROP_NAME, minCoresPerNode);
}
/**
* Returns the value of the {@value #MIN_GIGS_PER_NODE_PROP_NAME} property, or 0 if not present
* or set.
*/
public double getMinGigsPerNode() {
return props.getProperty(MIN_GIGS_PER_NODE_PROP_NAME) != null
? props.getDouble(MIN_GIGS_PER_NODE_PROP_NAME)
: INVALID_VALUE;
}
public void setMinGigsPerNode(double minGigsPerNode) {
props.setProperty(MIN_GIGS_PER_NODE_PROP_NAME, minGigsPerNode);
}
/**
* Returns the value of the {@value #REQUESTED_WALLTIME_PROP_NAME} property, or
* {@link #DEFAULT_WALL_TIME} if not present or set.
*/
public String getRequestedWallTime() {
return props.getProperty(REQUESTED_WALLTIME_PROP_NAME) != null
? props.getString(REQUESTED_WALLTIME_PROP_NAME)
: DEFAULT_WALL_TIME;
}
public void setRequestedWallTime(String requestedWallTime) {
props.setProperty(REQUESTED_WALLTIME_PROP_NAME, requestedWallTime);
}
/**
* Returns the value of the {@value #REMOTE_GROUP_PROP_NAME} property, or
* {@link #INVALID_STRING} if not present or set.
*/
public String getRemoteGroup() {
return props.getProperty(REMOTE_GROUP_PROP_NAME) != null
? props.getString(REMOTE_GROUP_PROP_NAME)
: INVALID_STRING;
}
public void setRemoteGroup(String remoteGroup) {
props.setProperty(REMOTE_GROUP_PROP_NAME, remoteGroup);
}
/**
* Returns the value of the {@value #QUEUE_NAME_PROP_NAME} property, or {@link #INVALID_STRING}
* if not present or set.
*/
public String getQueueName() {
return props.getProperty(QUEUE_NAME_PROP_NAME) != null
? props.getString(QUEUE_NAME_PROP_NAME)
: INVALID_STRING;
}
public void setQueueName(String queueName) {
props.setProperty(QUEUE_NAME_PROP_NAME, queueName);
}
public int getActiveCoresPerNode() {
return props.getProperty(ACTIVE_CORES_PER_NODE_PROP_NAME) != null
? props.getInt(ACTIVE_CORES_PER_NODE_PROP_NAME)
: INVALID_VALUE;
}
public void setActiveCoresPerNode(int activeCoresPerNode) {
props.setProperty(ACTIVE_CORES_PER_NODE_PROP_NAME, activeCoresPerNode);
}
public int getRequestedNodeCount() {
return props.getProperty(REQUESTED_NODE_COUNT_PROP_NAME) != null
? props.getInt(REQUESTED_NODE_COUNT_PROP_NAME)
: INVALID_VALUE;
}
public void setRequestedNodeCount(int requestedNodeCount) {
props.setProperty(REQUESTED_NODE_COUNT_PROP_NAME, requestedNodeCount);
}
public double getGigsPerSubtask() {
return props.getProperty(GIGS_PER_SUBTASK_PROP_NAME) != null
? props.getDouble(GIGS_PER_SUBTASK_PROP_NAME)
: INVALID_VALUE;
}
public void setGigsPerSubtask(double gigsPerSubtask) {
props.setProperty(GIGS_PER_SUBTASK_PROP_NAME, gigsPerSubtask);
}
/**
* Returns the value of the {@value #PBS_SUBMIT_PROP_NAME} property, or {@link #INVALID_VALUE}
* if not present or set.
*/
public long getPbsSubmitTimeMillis() {
return props.getProperty(PBS_SUBMIT_PROP_NAME) != null ? props.getLong(PBS_SUBMIT_PROP_NAME)
: INVALID_VALUE;
}
public void setPbsSubmitTimeMillis(long pbsSubmitTimeMillis) {
props.setProperty(PBS_SUBMIT_PROP_NAME, pbsSubmitTimeMillis);
}
/**
* Returns the value of the {@value #PFE_ARRIVAL_PROP_NAME} property, or {@link #INVALID_VALUE}
* if not present or set.
*/
public long getPfeArrivalTimeMillis() {
return props.getProperty(PFE_ARRIVAL_PROP_NAME) != null
? props.getLong(PFE_ARRIVAL_PROP_NAME)
: INVALID_VALUE;
}
public void setPfeArrivalTimeMillis(long pfeArrivalTimeMillis) {
props.setProperty(PFE_ARRIVAL_PROP_NAME, pfeArrivalTimeMillis);
}
@Override
public int hashCode() {
return Objects.hash(moduleName, numComplete, numFailed, numTotal, pipelineInstanceId,
pipelineTaskId, state);
}
@Override
public boolean equals(Object obj) {
if (this == obj) {
return true;
}
if (obj == null || getClass() != obj.getClass()) {
return false;
}
StateFile other = (StateFile) obj;
if (moduleName == null) {
if (other.moduleName == null) {
return false;
}
} else if (!moduleName.equals(other.moduleName)) {
return false;
}
if (numComplete != other.numComplete || numFailed != other.numFailed
|| numTotal != other.numTotal || pipelineInstanceId != other.pipelineInstanceId) {
return false;
}
if (pipelineTaskId != other.pipelineTaskId || state != other.state) {
return false;
}
return true;
}
/**
* Used to construct a {@link StateFile} object. To use this class, a {@link StateFile} object
* is created and then non-null fields are set using the available builder methods. Finally, a
* {@link StateFile} object is created using the {@code build} method. For example:
*
* <pre>
* StateFile stateFile = new StateFile.Builder().foo(fozar).bar(bazar).build();
* </pre>
*
* This pattern is based upon
* <a href= "http://developers.sun.com/learning/javaoneonline/2006/coreplatform/TS-1512.pdf" >
* Josh Bloch's JavaOne 2006 talk, Effective Java Reloaded, TS-1512</a>.
*
* @author PT
*/
public static class Builder {
private StateFile stateFile = new StateFile();
public Builder() {
}
public Builder moduleName(String moduleName) {
stateFile.setModuleName(moduleName);
return this;
}
public Builder executableName(String executableName) {
stateFile.setExecutableName(executableName);
return this;
}
public Builder pipelineInstanceId(long pipelineInstanceId) {
stateFile.setPipelineInstanceId(pipelineInstanceId);
return this;
}
public Builder pipelineTaskId(long pipelineTaskId) {
stateFile.setPipelineTaskId(pipelineTaskId);
return this;
}
public Builder state(State state) {
stateFile.setState(state);
return this;
}
public Builder numTotal(int numTotal) {
stateFile.setNumTotal(numTotal);
return this;
}
public Builder numComplete(int numComplete) {
stateFile.setNumComplete(numComplete);
return this;
}
public Builder numFailed(int numFailed) {
stateFile.setNumFailed(numFailed);
return this;
}
public StateFile build() {
return new StateFile(stateFile);
}
}
}

View File

@ -35,34 +35,34 @@ public class SubtaskAllocator {
}
}
public boolean markSubtaskComplete(int subTaskIndex) {
boolean found = markSubtaskNeedsNoFurtherProcessing(subTaskIndex);
subtaskCompleted[subTaskIndex] = true;
public boolean markSubtaskComplete(int subtaskIndex) {
boolean found = markSubtaskNeedsNoFurtherProcessing(subtaskIndex);
subtaskCompleted[subtaskIndex] = true;
return found;
}
public boolean markSubtaskLocked(int subTaskIndex) {
return markSubtaskNeedsNoFurtherProcessing(subTaskIndex);
public boolean markSubtaskLocked(int subtaskIndex) {
return markSubtaskNeedsNoFurtherProcessing(subtaskIndex);
}
private boolean markSubtaskNeedsNoFurtherProcessing(int subTaskIndex) {
private boolean markSubtaskNeedsNoFurtherProcessing(int subtaskIndex) {
boolean found = false;
for (int i = 0; i < currentPoolProcessing.size(); i++) {
if (currentPoolProcessing.get(i) == subTaskIndex) {
if (currentPoolProcessing.get(i) == subtaskIndex) {
found = true;
currentPoolProcessing.remove(i);
log.debug("removing subtaskIndex: " + subTaskIndex);
log.debug("Removing subtaskIndex {}", subtaskIndex);
}
}
if (!found) {
log.warn("failed to remove subtaskIndex: " + subTaskIndex);
log.warn("Failed to remove subtaskIndex {}", subtaskIndex);
return false;
}
return true;
}
/**
* Return the next sub-task available for processing
* Return the next subtask available for processing
*
* @return
*/

View File

@ -29,6 +29,9 @@ import gov.nasa.ziggy.util.AcceptableCatchBlock.Rationale;
public class SubtaskClient {
private static final Logger log = LoggerFactory.getLogger(SubtaskClient.class);
// How long should the client wait after a TRY_AGAIN response?
public static final long TRY_AGAIN_WAIT_TIME_MILLIS = 2000L;
// The SubtaskClient only handles one request / response at a time, hence the
// ArrayBlockingQueue needs only one entry.
private final ArrayBlockingQueue<Response> responseQueue = new ArrayBlockingQueue<>(1);
@ -38,22 +41,23 @@ public class SubtaskClient {
}
/**
* Client method to report that a sub-task has completed.
* Client method to report that a subtask has completed.
*/
public Response reportSubtaskComplete(int subTaskIndex) {
return request(RequestType.REPORT_DONE, subTaskIndex);
public Response reportSubtaskComplete(int subtaskIndex) {
return request(RequestType.REPORT_DONE, subtaskIndex);
}
/**
* Client method to report that a sub-task is locked by another compute node.
* Client method to report that a subtask is locked by another compute node.
*/
public Response reportSubtaskLocked(int subTaskIndex) {
return request(RequestType.REPORT_LOCKED, subTaskIndex);
public Response reportSubtaskLocked(int subtaskIndex) {
return request(RequestType.REPORT_LOCKED, subtaskIndex);
}
/**
* Get the next subtask for processing.
*/
@AcceptableCatchBlock(rationale = Rationale.CLEANUP_BEFORE_EXIT)
public Response nextSubtask() {
Response response = null;
@ -62,8 +66,13 @@ public class SubtaskClient {
if (response == null || response.status != ResponseType.TRY_AGAIN) {
break;
}
try {
Thread.sleep(tryAgainWaitTimeMillis());
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
log.debug("getNextSubTask: Got a response: " + response);
log.debug("getNextSubtask: Got a response: {}", response);
return response;
}
@ -94,7 +103,7 @@ public class SubtaskClient {
*/
private Response request(RequestType command, int subtaskIndex) {
log.debug("Sending request " + command + " with subtaskIndex " + subtaskIndex);
log.debug("Sending request {} with subtaskIndex {}", command, subtaskIndex);
send(command, subtaskIndex);
return receive();
}
@ -103,9 +112,9 @@ public class SubtaskClient {
* Sends a request to the {@link SubtaskServer}. This is acomplished by creating a new instance
* of {@link Request}, which is then put onto the server's {@link ArrayBlockingQueue}.
*/
private void send(RequestType command, int subTaskIndex) {
private void send(RequestType command, int subtaskIndex) {
log.debug("Connected to subtask server, sending request");
Request request = new Request(command, subTaskIndex, this);
Request request = new Request(command, subtaskIndex, this);
SubtaskServer.submitRequest(request);
}
@ -124,4 +133,8 @@ public class SubtaskClient {
return null;
}
}
long tryAgainWaitTimeMillis() {
return TRY_AGAIN_WAIT_TIME_MILLIS;
}
}

View File

@ -24,7 +24,7 @@ import gov.nasa.ziggy.module.SubtaskDirectoryIterator.GroupSubtaskDirectory;
*/
public class SubtaskDirectoryIterator implements Iterator<GroupSubtaskDirectory> {
private static final Logger log = LoggerFactory.getLogger(SubtaskDirectoryIterator.class);
private static final Pattern SUB_TASK_PATTERN = Pattern.compile("st-([0-9]+)");
private static final Pattern SUBTASK_PATTERN = Pattern.compile("st-([0-9]+)");
private final Iterator<GroupSubtaskDirectory> dirIterator;
private final LinkedList<GroupSubtaskDirectory> directoryList;
@ -33,8 +33,8 @@ public class SubtaskDirectoryIterator implements Iterator<GroupSubtaskDirectory>
public SubtaskDirectoryIterator(File taskDirectory) {
directoryList = new LinkedList<>();
buildDirectoryList(taskDirectory);
log.debug("Number of subtask directories detected in task directory "
+ taskDirectory.toString() + ": " + directoryList.size());
log.debug("Detected {} subtask directories in task directory {}", taskDirectory.toString(),
directoryList.size());
dirIterator = directoryList.iterator();
}
@ -45,7 +45,7 @@ public class SubtaskDirectoryIterator implements Iterator<GroupSubtaskDirectory>
File groupDir = file.getParentFile();
File subtaskDir = file;
directoryList.add(new GroupSubtaskDirectory(groupDir, subtaskDir));
log.debug("Adding: " + file);
log.debug("Adding {}", file);
}
}
@ -78,7 +78,7 @@ public class SubtaskDirectoryIterator implements Iterator<GroupSubtaskDirectory>
}
private int subtaskNumber(String name) {
Matcher matcher = SUB_TASK_PATTERN.matcher(name);
Matcher matcher = SUBTASK_PATTERN.matcher(name);
int number = -1;
if (matcher.matches()) {
number = Integer.parseInt(matcher.group(1));
@ -90,7 +90,7 @@ public class SubtaskDirectoryIterator implements Iterator<GroupSubtaskDirectory>
return currentIndex;
}
public int numSubTasks() {
public int numSubtasks() {
return directoryList.size();
}

View File

@ -43,7 +43,7 @@ import gov.nasa.ziggy.util.os.OperatingSystemType;
/**
* This class encapsulates the setup and execution algorithm code for a single subtask. The code is
* assumed to be runnable from the shell using the binary name for invocation. It also invokes the
* populateSubTaskInputs() method of the module's {@link PipelineInputs} subclass prior to running
* populateSubtaskInputs() method of the module's {@link PipelineInputs} subclass prior to running
* the algorithm, and invokes the populateTaskResults() method of the module's
* {@link PipelineOutputs} subclass subsequent to running the algorithm.
* <p>
@ -71,7 +71,7 @@ public class SubtaskExecutor {
private CommandLine commandLine;
private Map<String, String> environment = new HashMap<>();
private OperatingSystemType osType = OperatingSystemType.getInstance();
private OperatingSystemType osType = OperatingSystemType.newInstance();
private String libPath;
private String binPath;
@ -117,11 +117,11 @@ public class SubtaskExecutor {
String hostname = HostNameUtils.shortHostName();
log.info("osType = " + osType.toString());
log.info("hostname = " + hostname);
log.info("binaryDir = " + binaryDir);
log.info("binaryName = " + binaryName);
log.info("libPath = " + libPath);
log.info("osType = {}", osType.toString());
log.info("hostname = {}", hostname);
log.info("binaryDir = {}", binaryDir);
log.info("binaryName = {}", binaryName);
log.info("libPath = {}", libPath);
// Construct the environment
environment.put(MCR_CACHE_ROOT_ENV_VAR_NAME,
@ -153,7 +153,7 @@ public class SubtaskExecutor {
*/
private File binaryDir(String binPathString, String binaryName) {
File binFile = binaryDirInternal(binPathString, binaryName, null);
if (binFile == null && OperatingSystemType.getInstance() == OperatingSystemType.MAC_OS_X) {
if (binFile == null && OperatingSystemType.newInstance() == OperatingSystemType.MAC_OS_X) {
binFile = binaryDirInternal(binPathString, binaryName,
new String[] { binaryName + ".app", "Contents", "MacOS" });
}
@ -162,7 +162,7 @@ public class SubtaskExecutor {
private File binaryDirInternal(String binPathString, String binaryName, String[] pathSuffix) {
log.info("Searching for binary " + binaryName + " in path " + binPathString);
log.info("Searching for binary {} in path {}", binaryName, binPathString);
File binFile = null;
String[] binPaths = binPathString.split(File.pathSeparator);
for (String binPath : binPaths) {
@ -238,7 +238,7 @@ public class SubtaskExecutor {
}
sb.setLength(sb.length() - 2);
sb.append("]");
log.info("Execution environment: " + sb.toString());
log.info("Execution environment is {}", sb.toString());
}
/**
@ -265,16 +265,16 @@ public class SubtaskExecutor {
retCode = execAlgorithmInternal();
if (retCode != 0) {
log.warn("Marking subtask as failed because retCode = " + retCode);
log.warn("Marking subtask as failed (retCode={})", retCode);
markSubtaskFailed(workingDir);
}
if (errorFile.exists()) {
log.warn("Marking subtask as failed because an error file exists");
log.warn("Marking subtask as failed (error file exists)");
markSubtaskFailed(workingDir);
}
} catch (Exception e) {
log.warn("Marking subtask as failed because a Java-side exception occurred", e);
log.warn("Marking subtask as failed (Java-side exception occurred)", e);
markSubtaskFailed(workingDir);
}
return retCode;
@ -292,7 +292,7 @@ public class SubtaskExecutor {
}
AlgorithmStateFiles stateFile = new AlgorithmStateFiles(workingDir);
stateFile.updateCurrentState(AlgorithmStateFiles.SubtaskState.PROCESSING);
stateFile.updateCurrentState(AlgorithmStateFiles.AlgorithmState.PROCESSING);
boolean inputsProcessingSucceeded = false;
boolean algorithmProcessingSucceeded = false;
@ -320,7 +320,7 @@ public class SubtaskExecutor {
File errorFile = ModuleInterfaceUtils.errorFile(workingDir, binaryName);
if (retCode == 0 && !errorFile.exists()) {
stateFile.updateCurrentState(AlgorithmStateFiles.SubtaskState.COMPLETE);
stateFile.updateCurrentState(AlgorithmStateFiles.AlgorithmState.COMPLETE);
} else {
/*
* Don't handle an error in processing at this point in execution. Instead, allow the
@ -328,15 +328,15 @@ public class SubtaskExecutor {
* level, after some error-management tasks have been completed.
*/
stateFile.updateCurrentState(AlgorithmStateFiles.SubtaskState.FAILED);
stateFile.updateCurrentState(AlgorithmStateFiles.AlgorithmState.FAILED);
if (retCode != 0) {
if (!inputsProcessingSucceeded) {
log.error("failed to generate sub-task inputs, retCode = " + retCode);
log.error("Failed to generate subtask inputs (retCode={})", retCode);
} else if (algorithmProcessingSucceeded) {
log.error("failed to generate task results, retCode = " + retCode);
log.error("Failed to generate task results (retCode={})", retCode);
}
} else {
log.info("Algorithm process completed, retCode=" + retCode);
log.info("Algorithm process completed (retCode={})", retCode);
}
}
@ -349,7 +349,7 @@ public class SubtaskExecutor {
*/
public int execSimple(List<String> commandLineArgs) {
int retCode = runCommandline(commandLineArgs, binaryName);
log.info("execSimple: retCode = " + retCode);
log.info("retCode={}", retCode);
return retCode;
}
@ -406,7 +406,7 @@ public class SubtaskExecutor {
throw new UncheckedIOException("Unable to get process environment ", e);
}
log.info("Executing command: " + commandLine.toString());
log.info("Executing command {}", commandLine.toString());
return externalProcess.execute();
}
@ -423,19 +423,19 @@ public class SubtaskExecutor {
ZiggyFileUtils.ZIGGY_CHARSET)) {
File binary = new File(binaryDir.getPath(), binaryName);
if ((!binary.exists() || !binary.isFile())
&& OperatingSystemType.getInstance() == OperatingSystemType.MAC_OS_X) {
&& OperatingSystemType.newInstance() == OperatingSystemType.MAC_OS_X) {
binary = new File(binaryDir.getPath(),
binaryName + ".app/Contents/MacOS/" + binaryName);
}
log.info("executing " + binary);
log.info("binary={}", binary);
commandLine = new CommandLine(binary.getCanonicalPath());
for (String element : commandline) {
commandLine.addArgument(element);
}
log.info("CommandLine: " + commandLine);
log.info("commandLine={}", commandLine);
Map<String, String> env = EnvironmentUtils.getProcEnvironment();
@ -480,7 +480,7 @@ public class SubtaskExecutor {
externalProcess.timeout(timeoutSecs * 1000);
externalProcess.setCommandLine(commandLine);
log.info("env = " + env);
log.info("env={}", env);
retCode = externalProcess.execute();
} finally {
IntervalMetric.stop("pipeline.module.externalProcess." + binaryName + ".execTime",
@ -502,9 +502,9 @@ public class SubtaskExecutor {
}
private static void markSubtaskFailed(File workingDir) {
AlgorithmStateFiles subTaskState = new AlgorithmStateFiles(workingDir);
if (subTaskState.currentSubtaskState() != AlgorithmStateFiles.SubtaskState.FAILED) {
subTaskState.updateCurrentState(AlgorithmStateFiles.SubtaskState.FAILED);
AlgorithmStateFiles subtaskState = new AlgorithmStateFiles(workingDir);
if (subtaskState.currentAlgorithmState() != AlgorithmStateFiles.AlgorithmState.FAILED) {
subtaskState.updateCurrentState(AlgorithmStateFiles.AlgorithmState.FAILED);
}
}

View File

@ -5,7 +5,7 @@ import java.io.IOException;
import java.io.UncheckedIOException;
import java.nio.file.Paths;
import java.util.Objects;
import java.util.concurrent.Semaphore;
import java.util.concurrent.CountDownLatch;
import org.apache.commons.exec.DefaultExecutor;
import org.apache.commons.lang3.StringUtils;
@ -16,7 +16,6 @@ import gov.nasa.ziggy.module.SubtaskServer.ResponseType;
import gov.nasa.ziggy.module.hdf5.Hdf5ModuleInterface;
import gov.nasa.ziggy.module.io.AlgorithmErrorReturn;
import gov.nasa.ziggy.module.io.ModuleInterfaceUtils;
import gov.nasa.ziggy.module.remote.TimestampFile;
import gov.nasa.ziggy.util.AcceptableCatchBlock;
import gov.nasa.ziggy.util.AcceptableCatchBlock.Rationale;
import gov.nasa.ziggy.util.io.LockManager;
@ -44,18 +43,18 @@ public class SubtaskMaster implements Runnable {
int threadNumber = -1;
private final String node;
private final Semaphore complete;
private final CountDownLatch countdownLatch;
private final String binaryName;
private final String taskDir;
private final int timeoutSecs;
private final String jobId;
private final String jobName;
public SubtaskMaster(int threadNumber, String node, Semaphore complete, String binaryName,
String taskDir, int timeoutSecs) {
public SubtaskMaster(int threadNumber, String node, CountDownLatch countdownLatch,
String binaryName, String taskDir, int timeoutSecs) {
this.threadNumber = threadNumber;
this.node = node;
this.complete = complete;
this.countdownLatch = countdownLatch;
this.binaryName = binaryName;
this.taskDir = taskDir;
this.timeoutSecs = timeoutSecs;
@ -64,8 +63,7 @@ public class SubtaskMaster implements Runnable {
if (!StringUtils.isBlank(fullJobId)) {
jobId = fullJobId.split("\\.")[0];
jobName = System.getenv("PBS_JOBNAME");
log.info(
"job ID: " + jobId + ", job name: " + jobName + ", thread number: " + threadNumber);
log.info("jobId={}, jobName={}, threadNumber={}", jobId, jobName, threadNumber);
} else {
jobId = "none";
jobName = "none";
@ -77,12 +75,12 @@ public class SubtaskMaster implements Runnable {
public void run() {
try {
processSubtasks();
log.info("Node: " + node + "[" + threadNumber
+ "]: No more subtasks to process, thread exiting");
log.info("Node: {}[{}]: No more subtasks to process, thread exiting", node,
threadNumber);
} catch (Exception e) {
log.error("Exception thrown in SubtaskMaster", e);
} finally {
complete.release();
countdownLatch.countDown();
}
}
@ -103,7 +101,7 @@ public class SubtaskMaster implements Runnable {
response = subtaskClient.nextSubtask();
if (response == null) {
log.error("Null response from SubtaskClient, exiting.");
log.error("Null response from SubtaskClient, exiting");
break;
}
if (response.status.equals(ResponseType.NO_MORE)) {
@ -111,13 +109,13 @@ public class SubtaskMaster implements Runnable {
break;
}
if (!response.successful()) {
log.error("Unsuccessful response from SubtaskClient, exiting.");
log.error("Response content: {}", response.toString());
log.error("Unsuccessful response from SubtaskClient, exiting");
log.error("Response is {}", response.toString());
break;
}
subtaskIndex = response.subtaskIndex;
log.debug(threadNumber + ": Processing sub-task: " + subtaskIndex);
log.debug("threadNumber={}, subtaskIndex={}", threadNumber, subtaskIndex);
File subtaskDir = SubtaskUtils.subtaskDirectory(Paths.get(taskDir), subtaskIndex)
.toFile();
@ -129,8 +127,8 @@ public class SubtaskMaster implements Runnable {
SubtaskUtils.putLogStreamIdentifier(subtaskDir);
if (!checkSubtaskState(subtaskDir)) {
executeSubtask(subtaskDir, threadNumber, subtaskIndex);
subtaskClient.reportSubtaskComplete(subtaskIndex);
}
subtaskClient.reportSubtaskComplete(subtaskIndex);
} else {
subtaskClient.reportSubtaskLocked(subtaskIndex);
}
@ -141,6 +139,10 @@ public class SubtaskMaster implements Runnable {
// here to prevent same. The higher-level monitoring will manage any
// cases in which a subtask's processing failed.
logException(subtaskIndex, e);
// Also, tell the server and allocator not to bother trying again with
// this subtask.
subtaskClient.reportSubtaskComplete(subtaskIndex);
} finally {
SubtaskUtils.putLogStreamIdentifier((String) null);
if (lockFileObtained) {
@ -171,31 +173,30 @@ public class SubtaskMaster implements Runnable {
private boolean checkSubtaskState(File subtaskDir) {
AlgorithmStateFiles previousAlgorithmState = algorithmStateFiles(subtaskDir);
if (!previousAlgorithmState.subtaskStateExists()) {
if (!previousAlgorithmState.stateExists()) {
// no previous run exists
log.info("No previous algorithm state file found in " + subtaskDir.getName()
+ ", executing this subtask");
log.info("No previous algorithm state file found in {}, executing this subtask",
subtaskDir.getName());
return false;
}
if (previousAlgorithmState.isComplete()) {
log.info("subtask algorithm state = COMPLETE, skipping subtask" + subtaskDir.getName());
log.info("Subtask algorithm state COMPLETE, skipping subtask{}", subtaskDir.getName());
return true;
}
if (previousAlgorithmState.isFailed()) {
log.info(".FAILED state detected in directory " + subtaskDir.getName());
log.info(".FAILED state detected in directory {}", subtaskDir.getName());
return true;
}
if (previousAlgorithmState.isProcessing()) {
log.info(".PROCESSING state detected in directory " + subtaskDir.getName());
log.info(".PROCESSING state detected in directory {}", subtaskDir.getName());
return true;
}
log.info(
"Unexpected subtask algorithm state = " + previousAlgorithmState.currentSubtaskState()
+ ", restarting subtask " + subtaskDir.getName());
log.info("Unexpected subtask algorithm state {}, restarting subtask {}",
previousAlgorithmState.currentAlgorithmState(), subtaskDir.getName());
return false;
}
@ -212,18 +213,18 @@ public class SubtaskMaster implements Runnable {
+ ".node." + node;
try {
new File(subtaskDir, jobInfoFileName).createNewFile();
TimestampFile.create(subtaskDir, TimestampFile.Event.SUB_TASK_START);
TimestampFile.create(subtaskDir, TimestampFile.Event.SUBTASK_START);
SubtaskExecutor subtaskExecutor = subtaskExecutor(subtaskIndex);
log.info("START subtask: " + subtaskIndex + " on " + node + "[" + threadNumber + "]");
log.info("START subtask {} on {}[{}]", subtaskIndex, node, threadNumber);
retCode = subtaskExecutor.execAlgorithm();
log.info("FINISH subtask " + subtaskIndex + " on " + node + ", rc: " + retCode);
log.info("FINISH subtask {} on {} (retCode={})", subtaskIndex, node, retCode);
} catch (IOException e) {
throw new UncheckedIOException(
"Unable to create file " + new File(subtaskDir, jobInfoFileName).toString(), e);
} finally {
TimestampFile.create(subtaskDir, TimestampFile.Event.SUB_TASK_FINISH);
TimestampFile.create(subtaskDir, TimestampFile.Event.SUBTASK_FINISH);
}
if (retCode != 0) {
@ -269,7 +270,7 @@ public class SubtaskMaster implements Runnable {
* support testing.
*/
void logException(int subtaskIndex, Exception e) {
log.error("Error occurred during processing of subtask " + subtaskIndex, e);
log.error("Error occurred during processing of subtask {}", subtaskIndex, e);
}
/** Writes the algorithm stack trace, if any, to the algorithm log. */

View File

@ -10,7 +10,7 @@ import gov.nasa.ziggy.util.AcceptableCatchBlock;
import gov.nasa.ziggy.util.AcceptableCatchBlock.Rationale;
/**
* Serves sub-tasks to clients using {@link SubtaskAllocator}. Clients should use
* Serves subtasks to clients using {@link SubtaskAllocator}. Clients should use
* {@link SubtaskClient} to communicate with an instance of this class.
*
* @author Todd Klaus
@ -135,9 +135,9 @@ public class SubtaskServer implements Runnable {
this.status = status;
}
public Response(ResponseType status, int subTaskIndex) {
public Response(ResponseType status, int subtaskIndex) {
this.status = status;
subtaskIndex = subTaskIndex;
this.subtaskIndex = subtaskIndex;
}
public boolean successful() {
@ -149,7 +149,7 @@ public class SubtaskServer implements Runnable {
StringBuilder sb = new StringBuilder();
sb.append("Response [status=");
sb.append(status);
sb.append(", subTaskIndex=");
sb.append(", subtaskIndex=");
sb.append(subtaskIndex);
sb.append("]");
@ -176,7 +176,7 @@ public class SubtaskServer implements Runnable {
// Retrieve the next request, or block until one is provided.
Request request = requestQueue.take();
log.debug("listen[server,before]: request: " + request);
log.debug("listen[server,before] request={}", request);
Response response = null;
@ -185,7 +185,7 @@ public class SubtaskServer implements Runnable {
if (type == RequestType.GET_NEXT) {
SubtaskAllocation nextSubtask = subtaskAllocator().nextSubtask();
log.debug("Allocated: " + nextSubtask);
log.debug("Allocated {}", nextSubtask);
ResponseType status = nextSubtask.getStatus();
int subtaskIndex = nextSubtask.getSubtaskIndex();
@ -201,10 +201,10 @@ public class SubtaskServer implements Runnable {
subtaskAllocator().markSubtaskLocked(request.subtaskIndex);
response = new Response(ResponseType.OK);
} else {
log.error("Unknown command: " + type);
log.error("Unknown command {}", type);
}
log.debug("listen[server,after], response: " + response);
log.debug("listen[server,after] response={}", response);
// Send the response back to the client.
request.client.submitResponse(response);

View File

@ -68,6 +68,8 @@ public class SubtaskUtils {
}
public static void clearStaleAlgorithmStates(File taskDir) {
log.info("Removing stale PROCESSING state from task directory");
new AlgorithmStateFiles(taskDir).clearStaleState();
log.info("Finding and clearing stale PROCESSING or FAILED subtask states");
SubtaskDirectoryIterator it = new SubtaskDirectoryIterator(taskDir);
while (it.hasNext()) {

View File

@ -24,6 +24,7 @@ public class TaskDirectoryManager {
private final Path taskDataDir;
private final PipelineTask pipelineTask;
private Path taskDir;
public TaskDirectoryManager(PipelineTask pipelineTask) {
taskDataDir = DirectoryProperties.taskDataDir();
@ -31,23 +32,21 @@ public class TaskDirectoryManager {
}
public Path taskDir() {
return taskDir(pipelineTask.taskBaseName());
}
private Path taskDir(String taskBaseName) {
return taskDataDir.resolve(taskBaseName);
if (taskDir == null) {
taskDir = taskDataDir.resolve(pipelineTask.taskBaseName());
}
return taskDir;
}
@AcceptableCatchBlock(rationale = Rationale.EXCEPTION_CHAIN)
public synchronized Path allocateTaskDir(boolean cleanExisting) {
if (Files.isDirectory(taskDir()) && cleanExisting) {
log.info(
"Working directory for name=" + pipelineTask.getId() + " already exists, deleting");
log.info("Working directory for task {} already exists, deleting", pipelineTask);
ZiggyFileUtils.deleteDirectoryTree(taskDir());
}
log.info("Creating task working dir: " + taskDir().toString());
log.info("Creating task working dir {}", taskDir().toString());
try {
Files.createDirectories(taskDir());
} catch (IOException e) {

View File

@ -3,58 +3,151 @@ package gov.nasa.ziggy.module;
import java.io.File;
import java.nio.file.Path;
import java.util.List;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ScheduledThreadPoolExecutor;
import java.util.concurrent.TimeUnit;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import gov.nasa.ziggy.module.AlgorithmStateFiles.SubtaskState;
import gov.nasa.ziggy.module.AlgorithmStateFiles.AlgorithmState;
import gov.nasa.ziggy.module.AlgorithmStateFiles.SubtaskStateCounts;
import gov.nasa.ziggy.module.StateFile.State;
import gov.nasa.ziggy.util.io.LockManager;
import gov.nasa.ziggy.module.TimestampFile.Event;
import gov.nasa.ziggy.pipeline.definition.PipelineTask;
import gov.nasa.ziggy.pipeline.definition.ProcessingStep;
import gov.nasa.ziggy.pipeline.definition.database.PipelineTaskDataOperations;
import gov.nasa.ziggy.services.messages.AllJobsFinishedMessage;
import gov.nasa.ziggy.services.messages.HaltTasksRequest;
import gov.nasa.ziggy.services.messages.TaskProcessingCompleteMessage;
import gov.nasa.ziggy.services.messages.WorkerStatusMessage;
import gov.nasa.ziggy.services.messaging.ZiggyMessenger;
import gov.nasa.ziggy.util.AcceptableCatchBlock;
import gov.nasa.ziggy.util.AcceptableCatchBlock.Rationale;
import gov.nasa.ziggy.util.ZiggyShutdownHook;
import gov.nasa.ziggy.util.ZiggyUtils;
/**
* Provides tools to manage a task's state file.
* <p>
* The main function of the class is to, upon request, walk through the subtask directories for a
* given task and count the number of failed and completed subtasks. This information is used in two
* ways:
* <ol>
* <li>It allows the compute nodes to determine whether all subtasks are through with processing,
* which is important to execution decisions made by {@link ComputeNodeMaster}.
* <li>It allows the state file's subtask counts to be updated, and if necessary it allows the state
* to be updated to reflect that errors have occurred.
* </ol>
* In addition, once the {@link ComputeNodeMaster} has completed its post-processing, the class
* allows the state file to be marked as complete.
* given task and count the number of failed and completed subtasks. This allows the subtask counts
* in the database to be updated.
* <p>
* The TaskMonitor also detects whether all subtasks for a task are either completed or failed, and
* notifies the {@link AlgorithmMonitor} of this fact via an instance of
* {@link TaskProcessingCompleteMessage}. Conversely, the TaskMonitor also responds to
* {@link AllJobsFinishedMessage} instances sent from the {@link AlgorithmMonitor} by shutting down
* the monitoring for the given task and performing a final count of subtask states. The
* {@link AllJobsFinishedMessage} means that all remote jobs for the task have finished or been
* deleted, ergo no further processing will occur regardless of how many subtasks have not yet been
* processed.
*
* @author PT
* @author Bill Wohler
*/
public class TaskMonitor {
public class TaskMonitor implements Runnable {
private static final Logger log = LoggerFactory.getLogger(TaskMonitor.class);
private final StateFile stateFile;
private final File taskDir;
private final File lockFile;
private final List<Path> subtaskDirectories;
private static final long PROCESSING_MESSAGE_MAX_WAIT_MILLIS = 5000L;
private static final long FILE_SYSTEM_LAG_DELAY_MILLIS = 5000L;
private static final long FILE_SYSTEM_CHECK_INTERVAL_MILLIS = 100L;
private static final int FILE_SYSTEM_CHECKS_COUNT = (int) (FILE_SYSTEM_LAG_DELAY_MILLIS
/ FILE_SYSTEM_CHECK_INTERVAL_MILLIS);
public TaskMonitor(StateFile stateFile, File taskDir) {
private final File taskDir;
private final List<Path> subtaskDirectories;
private final ScheduledThreadPoolExecutor monitoringThread = new ScheduledThreadPoolExecutor(1);
final long pollIntervalMilliseconds;
final AlgorithmStateFiles taskAlgorithmStateFile;
private final PipelineTaskDataOperations pipelineTaskDataOperations = new PipelineTaskDataOperations();
private final PipelineTask pipelineTask;
private int totalSubtasks;
private boolean monitoringEnabled = true;
private boolean finishFileDetected;
public TaskMonitor(PipelineTask pipelineTask, File taskDir, long pollIntervalMilliseconds) {
subtaskDirectories = SubtaskUtils.subtaskDirectories(taskDir.toPath());
this.stateFile = stateFile;
this.taskDir = taskDir;
lockFile = new File(taskDir, StateFile.LOCK_FILE_NAME);
this.pollIntervalMilliseconds = pollIntervalMilliseconds;
taskAlgorithmStateFile = new AlgorithmStateFiles(taskDir);
this.pipelineTask = pipelineTask;
}
public void startMonitoring() {
if (pollIntervalMilliseconds > 0) {
monitoringThread.scheduleWithFixedDelay(this, 0, pollIntervalMilliseconds,
TimeUnit.MILLISECONDS);
ZiggyShutdownHook.addShutdownHook(() -> {
monitoringThread.shutdownNow();
});
}
// If the worker for this task has sent a final message, perform a final
// update and shut down the monitoring thread.
ZiggyMessenger.subscribe(WorkerStatusMessage.class, message -> {
handleWorkerStatusMessage(message);
});
// If all remote jobs for this task have exited, perform a final update and
// shut down the monitoring thread.
ZiggyMessenger.subscribe(AllJobsFinishedMessage.class, message -> {
handleAllJobsFinishedMessage(message);
});
// If the task has been halted, perform a final update and shut down the
// monitoring thread.
ZiggyMessenger.subscribe(HaltTasksRequest.class, message -> {
handleHaltTasksRequest(message);
});
}
@Override
public void run() {
update();
}
void handleWorkerStatusMessage(WorkerStatusMessage message) {
// We only care about the worker message if we're doing local processing; otherwise, the
// worker's exit is irrelevant because processing has already been handed over to PBS.
if (message.isLastMessageFromWorker() && message.getPipelineTask().equals(pipelineTask)
&& pipelineTaskDataOperations.algorithmType(pipelineTask) == AlgorithmType.LOCAL) {
update(true);
}
}
void handleAllJobsFinishedMessage(AllJobsFinishedMessage message) {
if (!pipelineTask.equals(message.getPipelineTask())) {
return;
}
update(true);
}
void handleHaltTasksRequest(HaltTasksRequest request) {
if (request.getPipelineTasks().contains(pipelineTask)) {
update(true);
}
}
private SubtaskStateCounts countSubtaskStates() {
// If this is the first time we're counting states, make sure that the total subtask
// count is set correctly.
if (totalSubtasks == 0) {
totalSubtasks = pipelineTaskDataOperations.subtaskCounts(pipelineTask)
.getTotalSubtaskCount();
}
SubtaskStateCounts stateCounts = new SubtaskStateCounts();
if (subtaskDirectories.isEmpty()) {
log.warn("No subtask directories found in: " + taskDir);
log.warn("No subtask directories found in {}", taskDir);
}
for (Path subtaskDir : subtaskDirectories) {
AlgorithmStateFiles currentSubtaskStateFile = new AlgorithmStateFiles(
subtaskDir.toFile());
SubtaskState currentSubtaskState = currentSubtaskStateFile.currentSubtaskState();
AlgorithmState currentSubtaskState = currentSubtaskStateFile.currentAlgorithmState();
if (currentSubtaskState == null) {
// no algorithm state file exists yet
@ -66,6 +159,10 @@ public class TaskMonitor {
return stateCounts;
}
public boolean allSubtasksProcessed() {
return allSubtasksProcessed(countSubtaskStates());
}
/**
* Determines whether all subtasks have been processed: specifically, this means that all the
* subtasks are in either the completed or failed states, and none are currently processing or
@ -73,82 +170,142 @@ public class TaskMonitor {
*
* @return true if all subtasks have been processed.
*/
public boolean allSubtasksProcessed() {
public boolean allSubtasksProcessed(SubtaskStateCounts stateCounts) {
return stateCounts.getCompletedSubtasks()
+ stateCounts.getFailedSubtasks() == totalSubtasks;
}
/**
* Makes a single pass through all of the subtask directories and updates the database based on
* the {@link AlgorithmStateFiles} instances. If the task has started processing, the update
* will detect the .PROCESSING file in the task directory and set the task step to EXECUTING.
*/
public void update() {
update(false);
}
/**
* Performs a task status update. If argument finalUpdate is true, the {@link TaskMonitor}
* performs an orderly shutdown of monitoring; this occurs in response to messages received by
* the task monitor, as all such messages signal to the monitor that processing has ended. The
* method is synchronized in order to prevent the monitoring loop and the message-induced update
* from interfering with one another.
*/
private synchronized void update(boolean finalUpdate) {
// If we're no longer monitoring, it means we don't need this update.
if (!monitoringEnabled) {
return;
}
if (subtaskDirectories.isEmpty()) {
log.warn("No subtask dirs found in {}", taskDir);
}
SubtaskStateCounts stateCounts = countSubtaskStates();
return stateCounts.getCompletedSubtasks() + stateCounts.getFailedSubtasks() == stateFile
.getNumTotal();
pipelineTaskDataOperations.updateSubtaskCounts(pipelineTask, -1,
stateCounts.getCompletedSubtasks(), stateCounts.getFailedSubtasks());
if (taskAlgorithmStateFile.isProcessing()
&& pipelineTaskDataOperations.processingStep(pipelineTask).isPreExecutionStep()) {
pipelineTaskDataOperations.updateProcessingStep(pipelineTask, ProcessingStep.EXECUTING);
}
boolean allSubtasksProcessed = allSubtasksProcessed(stateCounts);
// If this was a run-of-the-mill update, we're done.
if (!allSubtasksProcessed && !finalUpdate) {
return;
}
if (finalUpdate) {
log.debug("Final update for task {}", pipelineTask.getId());
}
if (allSubtasksProcessed) {
log.debug("All subtasks processed for task {}", pipelineTask.getId());
}
// If we got this far, then all subsequent calls to this method should return
// without taking any action.
monitoringEnabled = false;
checkForFinishFile();
publishTaskProcessingCompleteMessage(allSubtasksProcessed ? new CountDownLatch(1) : null);
// It is now safe to shut down the monitoring loop.
shutdown();
}
/**
* Makes a single pass through all of the subtask directories and updates the {@link StateFile}
* based on the {@link AlgorithmStateFiles}s. This method does not update the status to COMPLETE
* when all subtasks are done to allow the caller to do any post processing before the state
* file is updated. The state file should be marked COMPLETE with the markStateFileDone()
* method.
* Checks for the FINISH timestamp file in the task directory. This is created by the
* {@link ComputeNodeMaster} when it exits and is needed by
* {@link ExternalProcessPipelineModule} when it persists results. Because the subtask
* .COMPLETED files can appear before the compute node FINISH file, and because there are file
* system lags on network file systems, we need to do a repetitive check for the file rather
* than just a one-and-done.
*/
public void updateState() {
@AcceptableCatchBlock(rationale = Rationale.MUST_NOT_CRASH)
private void checkForFinishFile() {
try {
LockManager.getWriteLockOrBlock(lockFile);
StateFile diskStateFile = stateFile.newStateFileFromDiskFile(true);
if (subtaskDirectories.isEmpty()) {
log.warn("No subtask dirs found in: " + taskDir);
}
SubtaskStateCounts stateCounts = countSubtaskStates();
stateFile.setNumComplete(stateCounts.getCompletedSubtasks());
stateFile.setNumFailed(stateCounts.getFailedSubtasks());
stateFile.setState(diskStateFile.getState());
// If for some reason this state hasn't been upgraded to PROCESSING,
// do that now.
stateFile
.setState(diskStateFile.isStarted() ? diskStateFile.getState() : State.PROCESSING);
updateStateFile(diskStateFile);
} finally {
LockManager.releaseWriteLock(lockFile);
ZiggyUtils.tryPatiently("Wait for FINISH file", fileSystemChecksCount(),
fileSystemCheckIntervalMillis(), () -> {
if (!TimestampFile.exists(taskDir, Event.FINISH)) {
throw new Exception();
}
finishFileDetected = true;
return null;
});
} catch (PipelineException e) {
log.error("FINISH file never created in task directory {}", taskDir.toString());
finishFileDetected = false;
}
}
/**
* Move the {@link StateFile} into the completed state if all subtasks are complete, or into the
* failed state if some subtasks failed or were never processed.
*/
public void markStateFileDone() {
try {
LockManager.getWriteLockOrBlock(lockFile);
StateFile previousStateFile = stateFile.newStateFileFromDiskFile();
if (stateFile.getNumComplete() + stateFile.getNumFailed() == stateFile.getNumTotal()) {
log.info("All subtasks complete or errored, marking state file COMPLETE");
} else {
// If there is a shortfall, consider the missing sub-tasks failed
int missing = stateFile.getNumTotal()
- (stateFile.getNumComplete() + stateFile.getNumFailed());
log.info("Missing subtasks, forcing state to FAILED, missing=" + missing);
stateFile.setNumFailed(stateFile.getNumFailed() + missing);
}
stateFile.setState(StateFile.State.COMPLETE);
updateStateFile(previousStateFile);
} finally {
LockManager.releaseWriteLock(lockFile);
}
public void shutdown() {
monitoringThread.shutdown();
}
private void updateStateFile(StateFile previousStateFile) {
if (!previousStateFile.equals(stateFile)) {
log.info("Updating state: " + previousStateFile + " -> " + stateFile);
if (!StateFile.updateStateFile(previousStateFile, stateFile)) {
log.error("Failed to update state file: " + previousStateFile);
void publishTaskProcessingCompleteMessage(CountDownLatch processingCompleteMessageLatch) {
ZiggyMessenger.publish(new TaskProcessingCompleteMessage(pipelineTask), false,
processingCompleteMessageLatch);
if (processingCompleteMessageLatch != null) {
try {
processingCompleteMessageLatch.await(PROCESSING_MESSAGE_MAX_WAIT_MILLIS,
TimeUnit.MILLISECONDS);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
}
public StateFile getStateFile() {
return stateFile;
List<Path> getSubtaskDirectories() {
return subtaskDirectories;
}
Path getTaskDir() {
return taskDir.toPath();
}
AlgorithmStateFiles getTaskAlgorithmStateFile() {
return taskAlgorithmStateFile;
}
long fileSystemCheckIntervalMillis() {
return FILE_SYSTEM_CHECK_INTERVAL_MILLIS;
}
int fileSystemChecksCount() {
return FILE_SYSTEM_CHECKS_COUNT;
}
boolean isFinishFileDetected() {
return finishFileDetected;
}
void resetFinishFileDetection() {
finishFileDetected = false;
}
void resetMonitoringEnabled() {
monitoringEnabled = true;
}
}

View File

@ -1,15 +1,18 @@
package gov.nasa.ziggy.module.remote;
package gov.nasa.ziggy.module;
import java.io.File;
import java.io.FileFilter;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.Set;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import gov.nasa.ziggy.module.PipelineException;
import gov.nasa.ziggy.util.AcceptableCatchBlock;
import gov.nasa.ziggy.util.AcceptableCatchBlock.Rationale;
import gov.nasa.ziggy.util.io.ZiggyFileUtils;
/**
* @author Todd Klaus
@ -18,7 +21,7 @@ public abstract class TimestampFile {
private static final Logger log = LoggerFactory.getLogger(TimestampFile.class);
public enum Event {
ARRIVE_PFE, QUEUED_PBS, PBS_JOB_START, PBS_JOB_FINISH, SUB_TASK_START, SUB_TASK_FINISH
ARRIVE_COMPUTE_NODES, QUEUED, START, FINISH, SUBTASK_START, SUBTASK_FINISH
}
@AcceptableCatchBlock(rationale = Rationale.MUST_NOT_CRASH)
@ -41,20 +44,32 @@ public abstract class TimestampFile {
}
}
public static boolean delete(File directory, Event name) {
// delete any existing files with this prefix
String prefix = name.toString();
public static boolean exists(File directory, Event name) {
return !find(directory, name).isEmpty();
}
File[] files = directory.listFiles();
for (File file : files) {
if (file.getName().startsWith(prefix)) {
boolean deleted = file.delete();
public static Set<Path> find(File directory, Event name) {
return ZiggyFileUtils.listFiles(directory.toPath(), pattern(name));
}
private static String pattern(Event name) {
return name.toString() + "\\.[0-9]+";
}
@AcceptableCatchBlock(rationale = Rationale.MUST_NOT_CRASH)
public static boolean delete(File directory, Event name) {
Set<Path> files = find(directory, name);
for (Path file : files) {
try {
boolean deleted = Files.deleteIfExists(file);
if (!deleted) {
log.warn(
String.format("failed to delete existing timestamp file, dir=%s, file=%s",
directory, file));
log.warn("Failed to delete existing timestamp file, dir={}, file=}", directory,
file);
return false;
}
} catch (IOException e) {
log.error("Exception occurred when deleting {}", file.toString(), e);
return false;
}
}
return true;
@ -64,17 +79,26 @@ public abstract class TimestampFile {
return create(directory, name, System.currentTimeMillis());
}
@AcceptableCatchBlock(rationale = Rationale.CAN_NEVER_OCCUR)
public static long timestamp(File directory, final Event name) {
return timestamp(directory, name, true);
}
@AcceptableCatchBlock(rationale = Rationale.CAN_NEVER_OCCUR)
public static long timestamp(File directory, final Event name, boolean errorIfMissing) {
File[] files = directory
.listFiles((FileFilter) f -> f.getName().startsWith(name.toString()) && f.isFile());
if (files.length == 0) {
throw new PipelineException("Found zero files that match event:" + name);
if (!errorIfMissing) {
log.warn("Unable to find {} timestamp file in directory {}", name.toString(),
directory.toString());
return 0;
}
throw new PipelineException("Found zero files that match event: " + name);
}
if (files.length > 1) {
throw new PipelineException("Found more than one files that match event:" + name);
throw new PipelineException("Found more than one files that match event: " + name);
}
String filename = files[0].getName();
@ -113,10 +137,9 @@ public abstract class TimestampFile {
long finishTime = timestamp(directory, finishEvent);
if (startTime == -1 || finishTime == -1) {
// at least one of the events was missing or unparsable
log.warn(
String.format("Missing or invalid timestamp files, startTime=%s, finishTime=%s",
startTime, finishTime));
// At least one of the events was missing or unparsable.
log.warn("Missing or invalid timestamp files, startTime={}, finishTime={}", startTime,
finishTime);
return 0;
}
return finishTime - startTime;

View File

@ -32,10 +32,10 @@ public class WorkerMemoryManager {
private int availableMegaBytes;
public WorkerMemoryManager() {
MemInfo memInfo = OperatingSystemType.getInstance().getMemInfo();
MemInfo memInfo = OperatingSystemType.newInstance().getMemInfo();
long physicalMemoryMegaBytes = memInfo.getTotalMemoryKB() / KILO;
log.info("physicalMemoryMegaBytes: " + physicalMemoryMegaBytes);
log.info("physicalMemoryMegaBytes={}", physicalMemoryMegaBytes);
availableMegaBytes = (int) physicalMemoryMegaBytes;
long jvmMaxHeapMegaBytes = Runtime.getRuntime().maxMemory() / (KILO * KILO);
@ -47,8 +47,7 @@ public class WorkerMemoryManager {
*/
if (jvmMaxHeapMegaBytes < physicalMemoryMegaBytes / 2) {
log.info("JVM max heap size set to jvmMaxHeapMegaBytes: " + jvmMaxHeapMegaBytes);
log.info("JVM max heap size set to {}", jvmMaxHeapMegaBytes);
availableMegaBytes -= jvmMaxHeapMegaBytes;
} else {
/*
@ -60,8 +59,8 @@ public class WorkerMemoryManager {
long jvmInUseMegaBytes = Runtime.getRuntime().totalMemory() / (KILO * KILO);
if (jvmInUseMegaBytes < physicalMemoryMegaBytes / 2) {
log.info("JVM max heap size not available, using in-use bytes: jvmInUseMegaBytes: "
+ jvmInUseMegaBytes);
log.info("JVM max heap size not available, using in-use heap {}",
jvmInUseMegaBytes);
availableMegaBytes -= jvmInUseMegaBytes;
} else {
log.info("JVM heap size not available, not accounted for in pool");
@ -80,11 +79,10 @@ public class WorkerMemoryManager {
private void initSemaphore() {
memorySemaphore = new Semaphore(availableMegaBytes, true);
log.info("availableMegaBytes in memory manager pool: " + availableMegaBytes);
log.info("Memory manager pool has {} MB", availableMegaBytes);
}
/**
* @param megaBytes
* @see java.util.concurrent.Semaphore#acquire(int)
*/
@AcceptableCatchBlock(rationale = Rationale.MUST_NOT_CRASH)
@ -96,8 +94,7 @@ public class WorkerMemoryManager {
try {
memorySemaphore.acquire(megaBytes);
log.info(
megaBytes + " megabytes acquired, new pool size: " + availableMemoryMegaBytes());
log.info("Acquired {} MB, new pool size is {}", megaBytes, availableMemoryMegaBytes());
} catch (InterruptedException ignored) {
// If we got here, it means that a worker thread was waiting for Java heap to become
// available but that thread was interrupted. It is therefore no longer waiting for
@ -106,22 +103,18 @@ public class WorkerMemoryManager {
}
}
/**
* @param megaBytes
*/
private void logAcquirePrediction(int megaBytes) {
int numAvailPermits = memorySemaphore.availablePermits();
if (numAvailPermits < megaBytes) {
log.info("Requesting " + megaBytes + " megabytes from pool, but only " + numAvailPermits
+ " megabytes available, " + memorySemaphore.getQueueLength()
+ " threads already waiting (will probably block)...");
log.info(
"Requesting {} MB from pool, but only {} MB available, {} threads already waiting (will probably block)",
megaBytes, numAvailPermits, memorySemaphore.getQueueLength());
} else {
log.info("Requesting " + megaBytes + " megabytes from pool (probably won't block)...");
log.info("Requesting {} MB from pool (probably won't block)", megaBytes);
}
}
/**
* @param megaBytes
* @see java.util.concurrent.Semaphore#release(int)
*/
public void releaseMemoryMegaBytes(int megaBytes) {
@ -129,11 +122,11 @@ public class WorkerMemoryManager {
return;
}
log.info("Releasing " + megaBytes + " megabytes from pool...");
log.info("Releasing {} MB from pool", megaBytes);
memorySemaphore.release(megaBytes);
log.info(megaBytes + " megabytes released, new pool size: " + availableMemoryMegaBytes());
log.info("Released {} MB, new pool size is {} MB", megaBytes, availableMemoryMegaBytes());
}
/**

View File

@ -97,15 +97,15 @@ public class Hdf5ModuleInterface {
if (nOpen == 1) {
log.info("No unclosed HDF5 objects detected");
} else {
log.warn("Number of unclosed HDF5 objects detected: " + (nOpen - 1));
log.warn("Number of unclosed groups: "
+ H5.H5Fget_obj_count(fileId, HDF5Constants.H5F_OBJ_GROUP));
log.warn("Number of unclosed datasets: "
+ H5.H5Fget_obj_count(fileId, HDF5Constants.H5F_OBJ_DATASET));
log.warn("Number of unclosed datatypes: "
+ H5.H5Fget_obj_count(fileId, HDF5Constants.H5F_OBJ_DATATYPE));
log.warn("Number of unclosed attributes: "
+ H5.H5Fget_obj_count(fileId, HDF5Constants.H5F_OBJ_ATTR));
log.warn("Detected {} unclosed HDF5 objects", (nOpen - 1));
log.warn(" {} unclosed groups",
H5.H5Fget_obj_count(fileId, HDF5Constants.H5F_OBJ_GROUP));
log.warn(" {} unclosed datasets",
H5.H5Fget_obj_count(fileId, HDF5Constants.H5F_OBJ_DATASET));
log.warn(" {} unclosed datatypes",
H5.H5Fget_obj_count(fileId, HDF5Constants.H5F_OBJ_DATATYPE));
log.warn(" {} unclosed attributes",
H5.H5Fget_obj_count(fileId, HDF5Constants.H5F_OBJ_ATTR));
}
}

View File

@ -26,7 +26,7 @@ public class AlgorithmErrorReturn implements Persistable {
}
public void logStackTrace() {
log.error("Algorithm Stack Trace: msg=" + message + ", id=" + identifier);
log.error("Algorithm stack trace for msg={}, id={}", message, identifier);
for (AlgorithmStack stackFrame : stack) {
stackFrame.logStackTrace();
@ -65,8 +65,7 @@ public class AlgorithmErrorReturn implements Persistable {
private int line;
public void logStackTrace() {
log.error(
" Algorithm Stack Trace: file=" + file + ", name=" + name + ", line=" + line);
log.error(" Algorithm stack trace: file={}, name={}, line={}", file, name, line);
}
}
}

View File

@ -43,7 +43,7 @@ public class ModuleInterfaceUtils {
return;
}
String companionXmlFile = xmlFileName(moduleName);
log.info("Writing companion xml file \"" + companionXmlFile + "\".");
log.info("Writing companion xml file {}", companionXmlFile);
StringBuilder validationErrors = new StringBuilder();
try {
JAXBContext jaxbContext = JAXBContext.newInstance(inputs.getClass());
@ -130,7 +130,7 @@ public class ModuleInterfaceUtils {
private static void deleteErrorFile(File errorFileToDelete) {
boolean deleted = errorFileToDelete.delete();
if (!deleted) {
log.error("Failed to delete errorFile=" + errorFileToDelete);
log.error("Failed to delete errorFile {}", errorFileToDelete);
}
}
@ -143,7 +143,7 @@ public class ModuleInterfaceUtils {
}
/**
* Returns the name of the sub-task outputs file for a given module and sequence number.
* Returns the name of the subtask outputs file for a given module and sequence number.
*/
public static String outputsFileName(String moduleName) {
return moduleName + "-outputs." + BIN_FILE_TYPE;
@ -159,14 +159,14 @@ public class ModuleInterfaceUtils {
}
/**
* Returns the name of the sub-task error file for a given module and sequence number.
* Returns the name of the subtask error file for a given module and sequence number.
*/
public static String errorFileName(String moduleName) {
return moduleName + "-error." + BIN_FILE_TYPE;
}
/**
* Returns the name of the sub-task XML companion file for a given module and sequence number.
* Returns the name of the subtask XML companion file for a given module and sequence number.
*/
public static String xmlFileName(String moduleName) {
return moduleName + "-digest.xml";

View File

@ -63,7 +63,7 @@ public class MatlabUtils {
}
private static OperatingSystemType osType() {
return OperatingSystemType.getInstance();
return OperatingSystemType.newInstance();
}
private static String architecture() {

View File

@ -10,7 +10,7 @@
* <p>
* The processing steps found in the {@link gov.nasa.ziggy.pipeline.definition.ProcessingStep} enum
* are set with the
* {@link gov.nasa.ziggy.pipeline.definition.database.PipelineTaskOperations#updateProcessingStep(long, gov.nasa.ziggy.pipeline.definition.ProcessingStep)}
* {@link gov.nasa.ziggy.pipeline.definition.database.PipelineTaskDataOperations#updateProcessingStep(gov.nasa.ziggy.pipeline.definition.PipelineTask, gov.nasa.ziggy.pipeline.definition.ProcessingStep)}
* method and implicitly with the
* {@link gov.nasa.ziggy.pipeline.definition.PipelineModule#incrementProcessingStep()} method. Here
* is where each of these steps occur:

View File

@ -0,0 +1,112 @@
package gov.nasa.ziggy.module.remote;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Collection;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.apache.commons.collections4.CollectionUtils;
import org.apache.commons.lang3.StringUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import gov.nasa.ziggy.util.AcceptableCatchBlock;
import gov.nasa.ziggy.util.AcceptableCatchBlock.Rationale;
/**
* Extracts the exit comment and exit status from one or more PBS logs.
*
* @author PT
*/
public class PbsLogParser {
private static final Logger log = LoggerFactory.getLogger(PbsLogParser.class);
public static final String PBS_FILE_COMMENT_PREFIX = "=>> PBS: ";
public static final String PBS_FILE_STATUS_PREFIX = "Exit Status";
/**
* Extracts the exit comments from a collection of PBS logs and returns in a {@link Map} with
* job ID as the map key. Jobs with no comment will have no entry in the map.
*/
public Map<Long, String> exitCommentByJobId(
Collection<RemoteJobInformation> remoteJobsInformation) {
Map<Long, String> exitCommentByJobId = new HashMap<>();
if (CollectionUtils.isEmpty(remoteJobsInformation)) {
return exitCommentByJobId;
}
for (RemoteJobInformation remoteJobInformation : remoteJobsInformation) {
String exitComment = exitComment(remoteJobInformation);
if (!StringUtils.isBlank(exitComment)) {
exitCommentByJobId.put(remoteJobInformation.getJobId(), exitComment);
}
}
return exitCommentByJobId;
}
/** Returns the exit comment from a PBS log, or null if there is no exit comment. */
private String exitComment(RemoteJobInformation remoteJobInformation) {
List<String> pbsFileOutput = pbsLogFileContent(remoteJobInformation);
if (CollectionUtils.isEmpty(pbsFileOutput)) {
return null;
}
for (String pbsFileOutputLine : pbsFileOutput) {
log.debug("PBS file output line: {}", pbsFileOutputLine);
if (pbsFileOutputLine.startsWith(PBS_FILE_COMMENT_PREFIX)) {
return pbsFileOutputLine.substring(PBS_FILE_COMMENT_PREFIX.length());
}
}
return null;
}
/** Returns the content of a PBS log file as a {@List} of {@link String}s. */
@AcceptableCatchBlock(rationale = Rationale.MUST_NOT_CRASH)
private List<String> pbsLogFileContent(RemoteJobInformation remoteJobInformation) {
try {
return Files.readAllLines(Paths.get(remoteJobInformation.getLogFile()));
} catch (IOException e) {
// If an exception occurred, we don't want to crash the algorithm monitor,
// so return null.
return null;
}
}
/**
* Extracts the exit status from a collection of PBS log files and returns as a {@link Map} with
* job ID as the map key. Jobs with no exit status will have no entry in the map.
*/
public Map<Long, Integer> exitStatusByJobId(
Collection<RemoteJobInformation> remoteJobsInformation) {
Map<Long, Integer> exitStatusByJobId = new HashMap<>();
if (CollectionUtils.isEmpty(remoteJobsInformation)) {
return exitStatusByJobId;
}
for (RemoteJobInformation remoteJobInformation : remoteJobsInformation) {
Integer exitStatus = exitStatus(remoteJobInformation);
if (exitStatus != null) {
exitStatusByJobId.put(remoteJobInformation.getJobId(), exitStatus);
}
}
return exitStatusByJobId;
}
/** Returns the exit status from a PBS log file, or null if there is no exit status. */
private Integer exitStatus(RemoteJobInformation remoteJobInformation) {
List<String> pbsFileOutput = pbsLogFileContent(remoteJobInformation);
if (CollectionUtils.isEmpty(pbsFileOutput)) {
return null;
}
for (String pbsFileOutputLine : pbsFileOutput) {
log.debug("PBS file output line: {}", pbsFileOutputLine);
if (pbsFileOutputLine.trim().startsWith(PBS_FILE_STATUS_PREFIX)) {
int colonLocation = pbsFileOutputLine.indexOf(":");
String returnStatusString = pbsFileOutputLine.substring(colonLocation + 1).trim();
return Integer.parseInt(returnStatusString);
}
}
return null;
}
}

Some files were not shown because too many files have changed in this diff Show More