Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run jstack against Spark Driver process failed on MacOS M1 #5702

Open
zhouyifan279 opened this issue May 11, 2024 · 6 comments
Open

Run jstack against Spark Driver process failed on MacOS M1 #5702

zhouyifan279 opened this issue May 11, 2024 · 6 comments
Labels
bug Something isn't working triage

Comments

@zhouyifan279
Copy link
Contributor

zhouyifan279 commented May 11, 2024

Backend

VL (Velox)

Bug description

Launch spark-sql in local mode and run jstack against it:

export gluten_jar=/Users/zhouyifan/git/incubator-gluten/package/target/gluten-velox-bundle-spark3.5_2.12-osx_14.4_aarch_64-1.2.0-SNAPSHOT.jar

./bin/spark-sql \
  --conf spark.plugins=org.apache.gluten.GlutenPlugin \
  --conf spark.memory.offHeap.enabled=true \
  --conf spark.memory.offHeap.size=20g \
  --conf spark.driver.extraClassPath=${gluten_jar} \
  --conf spark.executor.extraClassPath=${gluten_jar} \
  --conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager

jstack exits with error message

74551: Unable to open socket file /var/folders/yj/25xqj6_52n51xmctftgl_77c0000gn/T/.attach_pid74551: target process 74551 doesn't respond within 10500ms or HotSpot VM not loaded

Spark version

spark-3.5.1-bin-hadoop3

Spark configurations

No response

System information

JDK

openjdk version "1.8.0_402"
OpenJDK Runtime Environment (Zulu 8.76.0.17-CA-macos-aarch64) (build 1.8.0_402-b06)
OpenJDK 64-Bit Server VM (Zulu 8.76.0.17-CA-macos-aarch64) (build 25.402-b06, mixed mode)

System

Velox System Info v0.0.2
Commit: 82e50ab196caff398013a3e76ca3b854a1156243
CMake Version: 3.29.3
System: Darwin-23.4.0
Arch: arm64
C++ Compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
C++ Compiler Version: 15.0.0.15000309
C Compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc
C Compiler Version: 15.0.0.15000309
CMake Prefix Path: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.4.sdk/usr;/opt/homebrew;/usr/local;/usr;/;/opt/homebrew/Cellar/cmake/3.29.3;/usr/local;/usr/X11R6;/usr/pkg;/opt;/sw;/opt/local

Relevant logs

No response

@zhouyifan279 zhouyifan279 added bug Something isn't working triage labels May 11, 2024
@zhouyifan279 zhouyifan279 changed the title jstack failed on Spark Driver process running on MacOS M1 Run jstack against Spark Driver process failed on MacOS M1 May 11, 2024
@zhouyifan279
Copy link
Contributor Author

zhouyifan279 commented May 11, 2024

Verified that:

  1. Oracle JDK 8 aarch64 (MacOS) has this bug.
  2. OpenJDK 17 aarch64 (MacOS) has this bug.
  3. OpenJDK 8 amd64 (Ubuntu) does not have this bug.

@zhouyifan279
Copy link
Contributor Author

Adding JVM option -XX:+StartAttachListener can make jstack work:

./bin/spark-sql \
  --conf spark.plugins=org.apache.gluten.GlutenPlugin \
  --conf spark.memory.offHeap.enabled=true \
  --conf spark.memory.offHeap.size=20g \
  --conf spark.driver.extraClassPath=${gluten_jar} \
  --conf spark.executor.extraClassPath=${gluten_jar} \
  --conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager \
  --conf spark.driver.extraJavaOptions=-XX:+StartAttachListener

@zhouyifan279
Copy link
Contributor Author

According to this doc, jstack communicates with JVM via a local socket file under JVM tmpdir, with filename pattern .java_pid. JVM creates .java_pid file when it receives SIGNAL_QUIT.

I ran the following test cases and observed different behavior of .java_pid file.

  1. JVM option -XX:+StartAttachListener is specified, .java_pid file is present when JVM starts.
  2. If JVM option -XX:+StartAttachListener is not specified and --conf spark.plugins=org.apache.gluten.GlutenPlugin is removed, .java_pid file is present after executing jstack.
  3. If JVM option -XX:+StartAttachListener is not specified and --conf spark.plugins=org.apache.gluten.GlutenPlugin is present, .java_pid file is not present event after executing jstack.

@zhouyifan279
Copy link
Contributor Author

A simplified program call reproduce this Bug.

import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;

public class AttachListener {

  public static void main(String[] args) throws InterruptedException {
    File file = extractVeloxLibrary();
    System.load(file.getAbsolutePath());
    System.out.println("Library velox loaded");
    Thread.sleep(Long.MAX_VALUE);
  }

  static File extractVeloxLibrary() {
    String tmpdir = System.getProperty("java.io.tmpdir");
    File file = new File(tmpdir, "libvelox.dylib");
    if (file.exists()) {
      file.delete();
    }
    try (InputStream is = AttachListener.class.getResourceAsStream("/libvelox.dylib");
         FileOutputStream fos = new FileOutputStream(file)) {
      byte[] buffer = new byte[4096];
      int read;
      while ((read = is.read(buffer)) != -1) {
        fos.write(buffer, 0, read);
      }
    } catch (java.io.IOException e) {
      throw new RuntimeException("Failed to extract library", e);
    }
    return file;
  }
}

Compile and run:

javac AttachListener.java
java -cp /path/to/incubator-gluten/package/target/gluten-velox-bundle-spark3.5_2.12-osx_14.4_aarch_64-1.2.0-SNAPSHOT.jar:/path/to/spark-3.5.1-bin-hadoop3/jars/*:. AttachListener

jstack also fails on AttachListener process.

@zhouyifan279
Copy link
Contributor Author

I guess libvelox.dylib affected JVM's internal mechanism.
But I'm not a JVM expert and have little knowledge about libvelox.dylib. I can't dig deeper to find the root cause.

OpenJDK Project has a similar issue: https://bugs.openjdk.org/browse/JDK-8235211, but seems not relevant.

@xumingming
Copy link
Contributor

I am using macOS(Apple Silicon), JDK:

openjdk version "1.8.0_402"
OpenJDK Runtime Environment (Zulu 8.76.0.17-CA-macos-aarch64) (build 1.8.0_402-b06)
OpenJDK 64-Bit Server VM (Zulu 8.76.0.17-CA-macos-aarch64) (build 25.402-b06, mixed mode)

and jstack works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

2 participants